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ABSTRACT 


Title  of  Dissertation:  COMMUNICATION-DRIVEN  CODESIGN 

FOR  MULTIPROCESSOR  SYSTEMS 

Neal  Kumar  Bambha,  Doctor  of  Philosophy,  2004 

Dissertation  directed  by:  Professor  Shuvra  S.  Bhattacharyya 

Department  of  Electrical  and  Computer  Engineering 


Several  trends  in  technology  have  important  implications  for  embedded  systems  of 
the  future.  One  trend  is  the  increasing  density  and  number  of  transistors  that  can  be 
placed  on  a  chip.  This  allows  designers  to  fit  more  functionality  into  smaller  devices, 
and  to  place  multiple  processing  cores  on  a  single  chip.  Another  trend  is  the  increas¬ 
ing  emphasis  on  low  power  designs.  A  third  trend  is  the  appearance  of  bottlenecks  in 
embedded  system  designs  due  to  the  limitations  of  long  electrical  interconnects,  and  in¬ 
creasing  use  of  optical  interconnects  to  overcome  these  bottlenecks.  These  trends  lead 
to  rapidly  increasing  complexity  in  the  design  process,  and  the  necessity  to  develop 
tools  that  automate  the  process.  This  thesis  will  present  techniques  and  algorithms  for 
developing  such  tools. 

Automated  techniques  are  especially  important  for  multiprocessor  designs.  Pro¬ 
gramming  such  systems  is  difficult,  and  this  is  one  reason  why  they  are  not  as  prevalent 


today.  In  this  thesis  we  explore  techniques  for  automating  and  optimizing  the  process 
of  mapping  applications  onto  system  architectures  containing  multiple  processors.  We 
examine  different  processor  interconnection  methods  and  topologies,  and  the  design  im¬ 
plications  of  different  levels  of  connectivity  between  the  processors. 

Using  optics,  it  is  practical  to  construct  processor  interconnections  having  arbitrary 
topologies.  This  can  offer  advantages  over  regular  interconnection  topologies.  However, 
existing  scheduling  techniques  do  not  work  in  general  for  such  arbitrarily  connected 
systems.  We  present  an  algorithm  that  can  be  used  to  supplement  existing  scheduling 
techniques  to  enable  their  use  with  arbitrary  interconnection  patterns. 

We  use  our  scheduling  techniques  to  explore  the  larger  problem  of  synthesizing  an 
optimal  interconnection  network  for  a  problem  or  group  of  problems. 

We  examine  the  problem  of  optimizing  synchronization  costs  in  multiprocessor  sys¬ 
tems,  and  propose  new  architectures  that  reduce  synchronization  costs  and  permit  effi¬ 
cient  performance  analysis. 

All  the  trends  listed  above  combine  to  add  dimensions  to  the  already  vast  design 
space  for  embedded  systems.  Optimizations  in  embedded  system  design  invariably  re¬ 
duce  to  searching  vast  design  spaces.  We  describe  a  new  hybrid  global/local  framework 
that  combines  evolutionary  algorithms  with  problem- specific  local  search  and  demon¬ 
strate  that  it  is  more  efficient  in  searching  these  spaces. 
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Chapter  1 


Introduction 


The  semiconductor  industry  has  demonstrated  remarkable  progress  during  the  past  four 
decades.  For  society,  this  has  meant  a  continual  decrease  in  the  cost  of  electronic  de¬ 
vices,  from  computers  to  mobile  phones  to  consumer  electronics,  and  their  increasing 
prevalence  in  our  lives.  Much  of  this  progress  results  from  the  ability  to  exponentially 
decrease  minimum  feature  sizes  used  to  fabricate  integrated  circuits.  The  most  fre¬ 
quently  cited  trend  is  Moore’s  Law,  which  states  that  the  number  of  components  on  a 
chip  doubles  every  18  months.  The  International  Technology  Roadmap  for  Semicon¬ 
ductors  predicts  that  by  the  year  2007,  it  will  be  possible  to  place  800  million  transistors 
in  a  one  square  centimeter  chip.  At  the  same  time,  design  cycle  times  have  decreased, 
and  interconnects  between  processing  elements  are  becoming  an  increasing  bottleneck. 
For  a  system  designer,  the  biggest  challenges  involve  making  effective  use  of  this  huge 
potential  functionality,  and  dealing  with  the  associated  complexity.  In  many  ways,  time 
is  a  much  more  precious  commodity  for  designers  today  than  is  chip  area.  For  this  rea¬ 
son,  tools  that  automate  the  design  process  are  essential  for  the  continued  progress  of 
the  industry.  There  has  been  much  research  done  on  lower  level  design  tools  which  op¬ 
timize  and  produce  a  physical  layout  for  a  circuit  that  has  been  described  in  a  sufficient 
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amount  of  detail.  Less  work  has  been  done  on  tools  for  automating  the  higher  levels  of 
design.  This  thesis  will  address  several  issues  relating  to  high  level  design  automation, 
with  a  focus  on  embedded  multiprocessor  systems. 

A  central  theme  in  this  work  is  the  effect  of  communication  costs  and  resource  con¬ 
tention  across  processors  in  the  system.  We  develop  techniques  and  algorithms  to  deal 
with  these  effects  in  systems  whose  complexity  ranges  from  low  cost  shared  bus  systems 
to  high  performance  multiprocessor  systems  utilizing  optical  interconnects.  Communi¬ 
cation  and  contention  effects,  along  with  the  nature  of  the  application,  influence  the  type 
of  interconnect  that  is  most  effective.  We  discuss  different  interconnection  methods  and 
present  algorithms  for  finding  an  optimal  interconnect  topology. 

All  our  optimization  problems  involve  searching  large,  complex  design  spaces.  In¬ 
deed,  through  our  work  with  a  diverse  variety  of  complex  optimization  problems,  we 
have  developed  unique  insights  on  general  methods  for  addressing  such  problems.  We 
present  a  broadly-applicable  framework,  which  has  been  derived  from  these  insights,  for 
searching  complex  design  spaces,  and  we  describe  how  our  optimization  problems  can 
be  solved  using  this  framework. 

1.1  Multiprocessor  Embedded  Systems 

An  embedded  system  is  a  combination  of  computing  hardware  and  software  designed  to 
perform  a  dedicated  function.  It  is  usually  part  of  a  larger  system,  such  as  the  processor 
in  a  cell  phone.  By  contrast,  a  general  purpose  computing  system  such  as  a  personal 
computer  is  designed  to  perform  many  functions.  Embedded  systems  typically  offer 
much  higher  performance,  lower  power,  and  lower  cost  for  their  dedicated  function 
than  a  general  purpose  system  performing  the  same  function.  Examples  of  embedded 
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systems  include  consumer  devices  like  MP3  players  and  cell  phones,  military  radar  and 
imaging  systems,  and  processors  for  automotive  engine  control. 

The  processing  elements  of  an  embedded  system  perform  two  main  tasks — control 
and  data  stream  processing.  The  control  functionality  consists  of  choosing  between 
modes  of  operation  for  the  device,  based  on  inputs  and  state  information.  For  example,  a 
simple  controller  chip  on  a  microwave  oven  controls  the  power  level  and  starts  and  stops 
the  oven  based  on  the  keypad  inputs.  Data  stream  processing,  or  digital  signal  processing 
(DSP),  is  required  in  devices  such  as  cell  phones,  which  must  sample  data  from  the  radio 
receiver  and  convert  it  into  a  digital  data  stream  using  algorithms  which  might  decrypt 
the  signal  and  correct  for  reception  errors.  In  this  thesis  we  will  focus  on  developing 
tools  that  optimize  the  signal  processing  (DSP)  functionality  of  a  system.  Processors 
with  architectures  that  are  optimized  to  provide  very  powerful  digital  signal  processing 
functionality  are  inexpensive,  readily  available,  and  prevalent  in  modem  devices. 

Applications  like  video  processing  and  automated  target  recognition  are  extremely 
computationally  intensive,  and  require  this  processing  to  be  performed  in  real  time. 
One  way  to  meet  these  requirements  is  to  design  very  large  scale  integrated  (VLSI) 
application-specific  integrated  circuits  (ASIC)  that  are  customized  for  the  specific  task. 
The  main  problem  with  this  approach  is  the  long  design  cycle,  and  the  fact  that  the 
design  is  not  flexible — if  there  are  changes  to  the  specifications,  a  new  set  of  ASICs 
must  be  designed  and  tested.  Programmable  solutions,  by  contrast,  allow  changes  to  be 
made  late  in  the  design  cycle  by  rewriting  the  software.  The  use  of  standard  processing 
cores  that  have  been  verified  for  correctness  eliminates  much  of  the  error-prone  testing 
and  debugging  associated  with  ASIC  design.  However,  it  is  often  the  case  that  a  single, 
standard  DSP  chip  cannot  deliver  the  performance  required  from  the  application.  In 
these  cases,  one  attractive  solution  is  to  utilize  multiple  processors.  Manufacturers  today 
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are  able  to  place  several  processors  on  a  single  die.  As  the  transistor  count  continues  to 
increase  this  becomes  more  cost  effective,  since  it  is  less  expensive  to  verify  and  test  a 
number  of  smaller,  standard  processing  elements  than  to  test  a  larger,  more  complicated 
design.  This  will  make  multiprocessor  design  increasingly  important  in  the  future.  One 
trade-off  that  comes  with  using  multiple  processors  is  that  programming  them  is  more 
complex,  since  it  is  necessary  to  deal  with  issues  such  as  synchronization,  deadlock, 
interconnect  architecture,  and  interprocessor  communication  costs.  Software  tools  are 
needed  that  allow  the  designer  to  specify  an  application  at  a  high  level,  and  that  automate 
the  details  like  synchronization  and  code  generation.  This  thesis  explores  algorithms  and 
techniques  to  develop  such  tools. 

1.2  Contributions  of  this  Thesis 

One  major  theme  of  this  thesis  is  an  analysis  of  the  effect  of  resource  contention  in 
multiprocessor  systems.  We  develop  methods  to  analyze  the  effects  of  contention,  ar¬ 
chitectures  that  are  optimized  to  deal  with  these  effects,  and  synthesis  techniques  and 
algorithms  tailored  to  these  architectures. 

We  consider  a  variety  of  systems  with  different  cost/performance  tradeoffs.  Each 
successive  level  of  hardware  complexity  reduces  the  effects  of  communication  cost  and 
resource  contention,  allows  higher  performance,  and  presents  unique  optimization  chal¬ 
lenges  for  the  designer.  We  present  techniques  to  deal  with  each  of  these  challenges. 

We  begin  with  a  shared  electrical  bus  system,  which  is  the  simplest,  lowest  cost 
solution.  The  effects  of  contention  are  the  most  pronounced  in  these  systems,  and  per¬ 
formance  analysis  is  also  the  most  complicated.  We  present  a  technique  that  makes 
analysis  more  efficient  in  these  systems. 
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In  order  to  reduce  synchronization  costs  and  improve  predictability  in  these  sys¬ 
tems,  researchers  have  previously  developed  an  ordered  transaction  strategy  that  adds 
a  hardware  controller  to  the  shared  bus  system  [11021]  and  have  analyzed  the  effects  of 
communication  costs  in  these  systems  [E2U-  In  this  thesis  we  present  a  modification 
of  this  idea  that  utilizes  optical  fiber  interconnects.  This  has  the  effect  of  dramatically 
reducing  communication  resource  contention  in  the  system. 

The  final,  most  complex  architecture  we  consider  is  a  multiprocessor  system  uti¬ 
lizing  free  space  interconnects.  This  can  eliminate  communication  resource  contention 
entirely.  One  unique  challenge  for  this  system  is  to  determine  an  optimal  partitioning 
of  the  chip  area  between  regions  that  are  connected  electrically  and  regions  that  are 
connected  optically. 

The  optically  connected  systems  offer  the  the  ability  to  tailor  the  interconnection 
network  optimally  for  a  specific  application.  This  opens  up  a  vast  new  design  space 
and  poses  several  interesting  challenges  in  scheduling  and  interconnect  synthesis.  We 
present  new  scheduling,  interconnect  synthesis,  and  optimization  techniques  to  address 
these  challenges. 

1.2.1  Contention  Analysis  in  Shared  Bus  Systems 

A  critical  challenge  in  synthesis  techniques  for  iterative  applications  is  the  efficient  anal¬ 
ysis  of  performance  in  the  presence  of  communication  resource  contention.  To  address 
this  challenge  for  shared  bus  systems  we  introduce  in  Chapter  |]  the  concept  of  the  period 
graph.  The  period  graph  is  constructed  from  the  output  of  a  simulation  of  the  system, 
with  idle  states  included  in  the  graph,  and  its  maximum  cycle  mean  is  used  to  estimate 
overall  system  throughput.  We  analyze  the  fidelity  of  this  estimator.  As  an  example  of 
the  utility  of  the  period  graph,  we  demonstrate  its  use  in  a  joint  power/performance  volt- 
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age  scaling  optimization  solution.  We  quantify  the  speedup  and  optimization  accuracy 
obtained  using  the  period  graph  compared  to  using  simulation  only. 

1.2.2  Architectures  Designed  for  Optically  Connected  Systems 

In  Chapter  |3]  we  will  discuss  the  role  that  optical  interconnects  can  play  in  embedded 
multiprocessor  systems,  and  derive  some  fundamental  equations  relating  to  optically 
connected  systems  on  chip.  We  will  introduce  three  architectures  on  which  a  broad 
class  of  high-throughput,  self-timed  DSP  applications  can  be  analyzed  accurately  using 
efficient  graph-theoretic  algorithms. 

1.2.3  Contention  Analysis  in  Optically  Connected  Systems 

Shared  bus  systems  are  appealing  due  to  their  simplicity  and  low  cost.  This  is  the  pri¬ 
mary  driver  for  many  embedded  systems  applications.  However,  a  shared  bus  sometimes 
cannot  meet  the  performance  requirements  for  systems  with  significant  interprocessor 
communication.  In  these  cases,  a  designer  may  consider  using  a  more  expensive  optical 
interconnect.  In  Chapter  [5]  we  will  explain  how  we  modified  the  IPC  graph  model  LHD2J 
and  the  synchronization  graph  model  [11  Kl]  to  work  with  the  optical  architectures  devel¬ 
oped  in  Chapter  0. 

1.2.4  Scheduling  for  Arbitrarily  Connected  Systems 

Optics  provide  the  ability  to  construct  highly  connected  and  irregular  networks  that  are 
streamlined  for  particular  applications.  Using  these  networks,  it  is  possible  to  implement 
application  mappings  that  allow  flexible,  single-hop  communication  patterns  between 
processors,  which  has  advantages  for  reduced  system  latency  and  power.  This  flexibil- 
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ity  is  particularly  promising  for  embedded  DSP  applications,  which  are  highly  parallel 
and  typically  have  tight  constraints  on  latency  and  power  consumption.  In  Chapter  |5] 
we  discuss  the  development  of  scheduling  methods  for  optically  connected  embedded 
multiprocessors.  We  demonstrate  that  existing  scheduling  techniques  will  deadlock  if 
communication  is  constrained  by  number  of  hops.  We  detail  an  efficient  algorithm  for 
avoiding  this  deadlock,  and  demonstrate  its  performance  on  several  benchmark  exam¬ 
ples. 

1.2.5  Synthesizing  an  Optimal  Interconnection  Network 

The  freedom  to  optimize  interconnection  patterns  opens  up  a  vast  design  space,  and 
thus  the  design  of  an  optimal  interconnect  structure  for  a  given  application  or  set  of 
applications  is  a  significant  challenge.  In  Chapter  [7J,  we  illustrate  both  probabilistic  and 
deterministic  interconnection  synthesis  algorithms.  A  key  distinguishing  feature  to  our 
interconnect  synthesis  algorithms  is  that  they  work  in  conjunction  with  a  scheduling 
strategy — most  existing  interconnect  synthesis  algorithms  assume  a  given  schedule. 

1.2.6  Simulated  Heating 

All  of  the  optimization  problems  we  have  considered,  such  as  dynamic  voltage  scal¬ 
ing,  scheduling,  and  interconnect  synthesis,  involve  the  search  of  vast  design  spaces. 
Most  DSP  optimization  problems  that  arise  in  hardware-software  co-design  also  involve 
searching  large  design  spaces.  For  many  of  these  problems,  efficient  local  search  algo¬ 
rithms  exist  for  refining  arbitrary  points  in  the  design  space  into  better  solutions.  In 
Chapter  |8]  we  introduce  a  novel  approach,  called  simulated  heating,  for  systematically 
integrating  parameterized  local  search  into  global  search  algorithms.  Using  the  frame¬ 
work  of  simulated  heating,  in  Chapter^  we  investigate  both  static  and  dynamic  strategies 
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for  systematically  managing  the  trade-off  between  local  search  accuracy  and  optimiza¬ 
tion  effort  for  the  voltage  scaling  application  mentioned  earlier,  as  well  as  a  memory 
cost  minimization  problem  and  a  more  widely  known  optimization  problem  (binary 
knapsack).  We  also  explain  how  simulated  heating  can  be  used  in  the  transaction  or¬ 
dering  optimization  problem  and  the  interconnect  synthesis  optimization  problem.  The 
application  of  simulated  heating  to  these  last  two  problems  is  the  subject  of  future  work. 
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Chapter  2 


Electronic  Design  Automation  for  Embedded 
Systems 


As  mentioned  earlier,  the  trend  toward  increasingly  complex  designs  makes  automated 
design  tools  very  attractive.  Ultimately,  we  would  like  a  single  tool  that  could  start 
with  an  abstract,  system-level  design  description  and  produce  details  of  an  optimized, 
hardware  implementation.  To  reach  this  goal,  we  must  have  a  suitable  framework  for 
describing  the  system  at  a  high  level  of  abstraction.  Automated  tools  should  be  able  to 
use  this  high  level  specification  to  generate  the  details  of  the  design.  This  chapter  will 
discuss  the  dataflow  specification,  and  how  it  can  be  used  for  high  level  design. 

2.1  Dataflow 

Dataflow  graphs  have  proven  to  be  a  very  useful  specification  for  signal  processing  sys¬ 
tems  for  several  reasons.  First,  they  support  block-diagram  based  visual  programming. 
Block  diagrams  (also  called  signal  flow  graphs  or  flow  charts),  are  a  versatile  and  im¬ 
portant  method  for  expressing  DSP  designs.  Some  of  the  most  powerful  DSP  design 
tools  use  block  diagrams  as  their  primary  design  language.  In  these  tools,  the  user  de- 
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scribes  a  signal  processing  system  by  assembling  a  block  diagram  from  a  library  of 
block  functions,  such  as  various  types  of  filters.  Examples  of  commercially  available 
tools  using  dataflow  and  visual  programming  are  the  Signal  Processing  Worksystem 
from  Cadence  LUJJ  and  System  Canvas  from  Angeles  Design  Systems  DS2D- 

A  second  strength  of  the  dataflow  specification  is  that  it  effectively  exposes  the  par¬ 
allelism  in  the  application.  It  is  difficult  to  compile  programs  written  in  imperative 
programming  languages  such  as  C  on  parallel  architectures,  since  these  languages  are 
known  to  over-specify  the  control  specification  and  the  streaming  specification.  Paral¬ 
lel  languages  such  as  Universal  Parallel  C  [E2J,  are  extensions  of  the  serial  languages 
intended  to  be  compiled  on  parallel  machines.  However,  these  languages  make  certain 
assumptions  about  the  hardware  and  are  not  applicable  to  a  general  architecture.  They 
also  require  the  programmer  to  explicitly  handle  lower-level  details  that  we  would  like 
to  avoid.  The  dataflow  model  imposes  minimal  data-dependency  constraints  in  its  spec¬ 
ification,  which  allows  the  compiler  to  effectively  detect  parallelism. 

A  third  advantage  of  the  dataflow  model  is  that  in  certain  restricted  forms  it  enables 
efficient  algorithms  for  determining  whether  a  program  will  deadlock,  and  whether  it 
can  be  implemented  in  a  finite  amount  of  memory.  This  is  not  possible  in  more  general 
computational  models,  as  will  be  discussed  later. 

We  will  focus  on  applications  that  can  be  described  by  synchronous  dataflow  graphs 
(SDF)  [E3],  and  its  various  extensions  such  as  boolean  dataflow  (BDF)  [03].  In  the  SDF 
model,  streams  of  data  flow  through  a  network  of  computational  nodes.  A  program 
is  represented  as  a  directed  dataflow  graph.  The  vertices  of  this  graph,  called  actors, 
represent  computations  and  the  edges  represent  FIFO  buffers  that  queue  the  data.  The 
data,  represented  by  tokens,  are  passed  from  the  output  of  one  computation  to  the  input 
of  another.  The  numbers  of  tokens  produced  and  consumed  by  each  actor  is  fixed.  The 
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programmer  specifies  the  function  performed  at  each  node.  The  only  constraints  that  are 
placed  on  order  of  evaluation  come  from  the  data  dependences  in  the  graph. 

Delays  on  SDF  edges  represent  initial  tokens,  and  specify  dependencies  between 
iterations  of  the  actors  in  iterative  execution.  For  example,  if  tokens  produced  by  the 
kth  invocation  of  actor  A  are  consumed  by  the  (k  +  2)th  invocation  of  actor  B ,  then  the 
edge  (A,  B)  contains  two  delays. 

Actors  can  be  of  arbitrary  complexity.  In  DSP  design  environments,  they  typically 
range  in  complexity  from  basic  operations  such  as  addition  or  subtraction  to  signal  pro¬ 
cessing  subsystems  such  as  FFT  units  and  adaptive  filters. 

We  refer  to  an  SDF  representation  of  an  application  as  an  application  graph.  In  this 
thesis  we  will  mostly  concentrate  on  a  form  of  SDF  called  homogeneous  SDF  (HSDF) 
that  is  suitable  for  dataflow-based  multiprocessor  design  tools  since  it  exposes  paral¬ 
lelism  more  thoroughly.  In  HSDF,  each  actor  transfers  a  single  token  to/from  each 
incident  edge.  General  techniques  for  converting  SDF  graphs  into  HSDF  form  are  de¬ 
veloped  in  0520.  We  represent  a  dataflow  graph  by  an  ordered  pair  ( V,E ),  where  V  is 
the  set  of  actors  and  E  is  the  set  of  edges.  We  refer  to  the  source  and  sink  actors  of  a 
dataflow  edge  e  by  src(e)  and  snk(e),  we  denote  the  delay  on  e  by  delay(e),  and  we 
can  represent  an  edge  e  by  the  ordered  pair  (src(e),  snk(e)).  We  say  that  e  is  an  output 
edge  of  src(e);  e  is  an  input  edge  of  snk(e);  and  e  is  delay-less  if  delay(e)  =  0.  The 
execution  time  or  estimated  execution  time  of  an  actor  v  is  denoted  t{u). 

Fundamental  work  related  to  the  dataflow  model  was  the  work  on  computational 
graphs  by  Karp  and  Miller  [El].  In  this  model,  the  computation  is  represented  as  a 
directed  graph  where  nodes  represent  operations  and  edges  represent  queues  of  data. 
Karp  and  Miller  proved  that  computation  graphs  with  certain  properties  are  determinate , 
which  means  that  the  sequence  of  data  values  produced  by  each  node  does  not  depend 
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Figure  2.1:  Marked  Petri  net. 

on  the  schedule ,  or  order  of  execution  of  the  actors.  They  gave  conditions  to  determine 
graphs  whose  computation  can  proceed  indefinitely  (avoidance  of  deadlock). 

Several  forms  of  dataflow  are  special  cases  of  Petri  nets.  A  general  form  of  Petri  nets 
is  discussed  in  [ES].  A  Petri  net  is  a  directed  graph,  G  =  (V,  A)  where  V  —  (z/i, . . . ,  vs} 
is  the  set  of  vertices  and  A  =  {ai, . . . ,  ar}  is  a  bag[]of  arcs.  The  set  V  can  be  partitioned 
into  two  disjoint  sets  P,  representing  places  and  T,  representing  transitions.  Every  arc 
in  a  Petri  net  connects  a  place  to  a  transition  or  a  transition  to  a  place.  Places  may 
contain  some  number  of  tokens.  A  marking  of  a  Petri  net  is  a  sequence  of  nonnegative 
integers  for  each  place  in  the  net,  representing  the  number  of  tokens  in  the  place.  A 
Petri  net  together  with  a  marking  is  called  a  marked  Petri  net.  An  example  is  given 
below  in  Figure  (ZTT|.  A  Petri  net  executes  by  firing  transitions.  When  a  transition  fires, 

'A  bag  is  distinguished  from  a  set  in  that  a  given  element  can  be  included  n  times  in  a  bag,  so  that  the 
membership  function  is  integer-valued  rather  than  boolean-valued. 
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one  token  is  removed  from  each  input  place  of  the  transition  and  one  token  is  added 
to  each  output  place  of  the  transition.  A  transition  that  has  enough  tokens  on  its  input 
places  to  fire  is  enabled.  Enabled  transitions  may  fire,  but  are  not  required  to.  Firing 
may  occur  in  any  order  and  may  continue  as  long  as  at  least  one  transition  is  enabled. 
In  Figure  [ITT],  transitions  ti,t2,  and  /4  are  enabled.  The  marking  can  be  represented  as 
a  vector  {1, 1,  2, 0}.  If  transition  f4  is  fired,  the  new  marking  will  be  {1, 1, 1, 1}  and 
transition  t3  will  be  enabled. 

Marked  graphs  are  a  subclass  of  Petri  nets.  A  marked  graph  is  a  Petri  net  in  which 
every  place  has  exactly  one  input  transition  and  one  output  transition.  A  marked  graph 
can  be  represented  by  a  graph  with  only  a  single  type  of  node  corresponding  to  transi¬ 
tions,  with  the  data  tokens  considered  to  exist  on  the  arcs.  This  representation  is  standard 
in  dataflow.  The  properties  of  marked  graphs  were  first  investigated  in  [E3J. 

The  application  of  dataflow  to  computer  architectures  and  programming  languages 
was  pioneered  by  Dennis  [ESQ.  The  dataflow  model  of  computer  architecture  was  de¬ 
signed  to  enforce  the  ordering  of  instruction  execution  according  to  data  dependencies. 
Execution  of  instructions  is  driven  by  the  availability  of  data,  as  opposed  to  the  more 
conventional  von  Neumann  computer  where  the  execution  of  instructions  is  controlled 
by  a  program  counter.  In  a  static  dataflow  machine,  dataflow  graphs  are  executed  di¬ 
rectly  maintaining  at  the  machine  level  a  representation  of  the  program  as  a  dataflow 
graph  and  by  providing  hardware  capabilities  to  detect  when  an  actor  has  sufficient  data 
to  fire.  There  is  a  restriction  that  at  most  one  data  value  can  be  queued  on  an  edge  at 
one  time.  This  enables  the  storage  for  edges  to  be  determined  at  compile  time.  How¬ 
ever,  this  restriction  also  limits  the  amount  of  parallelism  that  can  be  extracted  from 
loops.  The  tagged-token  dataflow  model  P,  was  created  to  overcome  this  restric¬ 
tion.  This  model  supports  the  execution  of  loop  iterations  and  function  invocations  in 
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parallel.  Data  values  are  carried  by  tokens  that  include  a  three-part  tag.  The  first  part  of 
the  tag  marks  the  current  procedure  invocation,  the  second  part  of  the  tag  marks  the  loop 
iteration  number,  and  the  third  part  of  the  tag  identifies  the  instruction  number.  Dataflow 
computers  successfully  address  the  problems  of  synchronization  and  memory  latency, 
but  are  not  as  successful  in  coping  with  the  resource  requirements  of  large  amounts  of 
parallelism  in  the  code.  This  is  due  to  the  overhead  in  keeping  track  of  the  data  tags. 
Although  some  research  continues  on  dataflow  computers,  none  are  in  commercial  de¬ 
velopment  today.  Most  research  into  dataflow  today  applies  to  program  representation. 

Synchronous  dataflow  (SDF)  is  a  restricted  version  of  dataflow  in  which  the  num¬ 
ber  of  tokens  produced  and  consumed  by  an  actor  on  each  edge  is  fixed  and  known  at 
compile  time.  Application  of  the  SDF  model  to  programming  of  multirate  DSP  sys¬ 
tems  was  originated  by  Lee  and  Messerschmitt  [E2D-  Lee  and  Messerschmitt  provided 
efficient  techniques  to  determine  at  compile  time  whether  or  not  an  arbitrary  SDF  graph 
has  a  periodic  schedule  that  neither  deadlocks  nor  requires  unbounded  buffer  sizes.  They 
also  presented  efficient  methods  for  constructing  such  a  periodic  schedule  whenever  one 
exists.  The  SDF  model  has  been  successful  at  describing  a  large  class  of  DSP  applica¬ 
tions  and  has  been  utilized  in  numerous  design  environments.  Techniques  for  compiling 
general  SDF  programs  for  multirate  DSP  systems  into  efficient  uniprocessor  implemen¬ 
tations  that  minimize  both  code  and  data  memory  requirements  is  presented  in  [11 5IJ. 

2.2  Architectural  Synthesis 

System-level  synthesis  requires  as  a  first  step  the  selection  of  an  architecture.  In  some 
cases,  the  designer  is  given  a  fixed  platform,  so  the  number  of  computing  elements  (pro¬ 
cessors,  functional  units,  etc.)  is  fixed  in  advance.  More  commonly  in  embedded  system 
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design,  there  is  at  least  some  flexibility  to  choose  the  number  and  types  of  processing 
elements  and  their  interconnection.  Even  with  a  fixed  platform,  there  can  be  choices 
between  which  tasks  are  performed  by  dedicated  hardware  units  and  which  tasks  are 
performed  in  software.  Modern  systems  consist  of  an  increasing  number  of  these  pro¬ 
cessing  elements,  each  of  which  can  be  highly  complex.  The  design  may  be  realized  on 
a  single  chip  ( system  on  chip  or  SoC),  in  a  multichip  design  using  multi-chip  modules 
(mem),  or  on  separate  circuit  boards. 

The  system  synthesis  problem  can  be  described  formally  by  means  of  a  specifica¬ 
tion  graph  Q1 051],  which  is  a  graph  Gs  =  (Vs,  Es)  consisting  of  D  dependence  graphs 
Gi{VuEi)  for  1  i  D  End  e  set  of  niEpping  edges  Em->  where  Vg  —  Uf=i  *5, 
Es  =  (jf=1  Ei  U  Em,  and  EM  =  U,=V  Em.  Here,  EMi  C  V)  x  Vi+1  for  1  <  i  <  D. 

The  specification  graph  consists  of  layers  of  dependence  graphs,  each  correspond¬ 
ing  to  a  different  level  of  abstraction.  For  example,  an  application  graph  describes  the 
algorithm,  an  architecture  graph  describes  the  architecture,  and  a  chip  graph  describes 
the  physical  components  of  the  system.  An  edge  in  the  specification  graph  between  a 
task  and  a  resource  means  that  task  can  be  implemented  by  that  resource. 

This  can  be  better  described  by  considering  an  example.  The  example  in  Figure  [ZT] 
was  taken  from  [11.051] .  Figure  \Z72 |a)  depicts  an  application  graph  with  four  computa¬ 
tional  nodes  and  three  communication  nodes  (shaded).  The  architecture,  depicted  in 
Figure  |2.2|b),  consists  of  a  RISC  processor  and  two  dedicated  hardware  modules.  The 
hardware  modules  are  connected  to  each  other  by  a  point-to-point  bus,  and  to  the  RISC 
processor  by  a  shared  bus.  The  architecture  graph  corresponding  to  Figure  [2.2|b)  is 
shown  in  Figure  |2.2|c).  The  physical  implementation  consists  of  two  separate  chips 
shown  in  Figure  |2.2|d)  with  a  corresponding  chip  graph  depicted  in  Figure  [ZZle).  The 
specification  graph  is  shown  in  Figure  JO],  The  edges  EM1  and  EM2  describe  all  possi- 
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chip  1 


off-chip  bus 


chip  2 


Figure  2.2:  Example  of  a  problem  graph,  an  architecture  graph,  and  a  chip  graph. 


Figure  2.3:  Specification  graph  corresponding  to  example  of  Figure  P~2|. 
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ble  mappings.  The  edges  EM1  describe  the  possible  mappings  between  the  application 
graph  and  the  architecture  graph.  We  can  see  that  task  v4  can  only  be  executed  on  v-rjsc 
and  task  v2  can  be  executed  on  either  v-rjsc  or  t'hwm2-  Communication  task  v7  can  be 
executed  on  the  shared  bus  vsb-  It  can  also  be  executed  on  vmsc  if  tasks  v3  and  v4  both 
execute  on  vrisc,  or  on  dhwmi  if  v3  and  v4  also  execute  on  uhwmi-  The  edges  E\r>  de¬ 
scribe  the  possible  mappings  between  the  architecture  graph  and  the  chip  graph.  From 
these  edges  we  can  see  that  any  of  the  tasks  in  the  architecture  graph  (the  RISC  pro¬ 
cessor,  shared  bus,  point-to-point  bus,  and  both  hardware  modules)  can  be  implemented 
inside  CHIP1,  and  that  the  shared  bus  vsb  can  be  handled  by  CHIP1  or  by  the  off  chip 
bus  vocb-  The  dashed  nodes  and  edges  in  Figure  [Z3]  are  not  allocated  in  the  imple¬ 
mentation.  The  specification  graph  allows  us  to  state  a  formal  definition  for  allocation, 
binding,  and  scheduling. 

The  activation  of  a  specification  graph  Gs(Vs,  Es)  is  a  function  a  :  Vs  U  Es  i— >  {0, 1} 
that  assigns  to  each  edge  e  G  Es  and  each  node  v  G  Vs  the  value  1  (activated)  or  0  (not 
activated). 

An  allocation  a  of  a  specification  graph  is  the  subset  of  all  activated  nodes  and 
edges  of  the  dependence  graphs  a  =  ay  U  with  av  =  {v  e  Vs  \  a(v)  =  1}  and 
ole  —  Uf=i{e  G  Ej  |  a(e)  =  1}.  For  the  example  above,  the  allocation  of  nodes  is 

Oty  =  {^RISC,  tfitWMl)  ^SB,  ^CHIPl}- 

A  binding  (3  is  the  subset  of  all  activated  mapping  edges  so  that  f3  =  (e  G  EM  \ 
a(e)  =  1}.  For  the  example  above,  the  binding  is 

P  =  { (^1,  fRISc),  (^2,  Wrisc),  (V3,  ?'HWMl),  (^4,  ^RISc),  (^5,  Vsb),  (^6,  ^Risc), 

(V7,  vsb),  (wrisc,  ^chipi),  ( vsb ,  ^chipi),  (%wmi,  ^chipi)} 

so  that  all  the  architecture  components  are  bound  to  CHIP1. 

A  feasible  binding  (3  is  a  binding  that  satisfies  the  following  criteria: 
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1 .  Each  activated  edge  e  E  (3  starts  and  ends  at  an  activated  node. 


2.  For  each  activated  node  v  E  ay  with  v  E  Vi,  1  <  i  <  D  exactly  one  outgoing 
edge  e  E  VM  is  activated. 

3.  For  each  activated  edge  e  =  (■ Vi,Vj )  E  aE  with  e  E  E-,,  1  <  i  <  I)  either 
both  operations  are  mapped  onto  the  same  node  or  there  exists  an  activated  edge 
e  =  (vi,Vj)  E  a e  with  e  E  El+]  to  handle  the  communication  associated  with 
edge  e,  i.e.  (vit  v/)  E  aE  with  (vh  vt),  {vv  Vj)  E  (3. 

It  has  been  shown  that  the  problem  of  finding  a  feasible  binding  is  NP-complete  [IT9iJ . 

A  schedule  is  a  function  r  :  Vp  t— >•  Z+  that  satisfies  for  all  edges  e  =  (vi,  Vj)  E  EP 
the  condition  r(vj)  >  r(vi )  +  delay  (vt,  /3)  where  del  ay  (v,  (3 )  is  the  execution  time 
delay  of  node  v  given  a  binding  [3.  For  the  example  above  a  valid  schedule  is  t(v\  )  =  0, 
r(v2)  =  1,  r(v3)  =  2,  t(v4)  =  21,  t(v5)  =  1,  t(v6)  =  21,  r(v7)  =  4. 

A  valid  implementation  is  a  triple  (a,  f3,  r)  where  a  is  an  allocation,  (3  is  a  binding, 
and  r  is  a  schedule. 

Finally,  with  the  definitions  above  we  can  state  the  problem  formally:  system  syn¬ 
thesis  consists  of  minimizing  a  function  h(a,P,r)  which  describes  an  optimization 
goal,  subject  to 

•  a  is  a  feasible  allocation, 

•  (3  is  a  feasible  binding, 

•  r  is  a  schedule. 
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2.3  Scheduling 


Implementing  an  algorithm  specified  as  a  dataflow  graph  (DFG)  on  a  multiprocessor  sys¬ 
tem  requires  “scheduling”  the  actors.  Scheduling  was  defined  formally  in  Section  [2.2|. 
Scheduling  involves  the  tasks  of  (1)  assigning  actors  in  the  DFG  to  processors,  (2)  order¬ 
ing  the  execution  of  these  actors  on  each  processor,  and  (3)  determining  the  start  times  of 
all  the  actors  while  maintaining  the  data  precedence  constraints  of  the  DFG.  Scheduling 
has  been  studied  extensively  in  many  contexts,  and  has  been  classified  based  on  which 
of  the  tasks  listed  above  are  performed  at  compile  time  and  which  at  run  time  [ifSSIJ . 

If  all  three  are  performed  at  compile  time,  the  scheduling  strategy  is  said  to  b e  fully 
static.  This  method  requires  the  least  possible  runtime  overhead.  The  exact  execution 
times  of  all  the  actors  are  assumed  to  be  given  in  advance.  The  processors  can  run  in 
lock  step  according  to  the  schedule,  and  no  explicit  synchronization  is  required  when 
they  communicate  data.  However,  the  exact  run  times  of  the  actors  cannot  usually  be 
determined  in  advance,  so  the  fully  static  strategy  is  often  not  practical. 

For  DSP  applications,  it  is  usually  realistic  to  assume  that  good  estimates  for  the 
execution  times  can  be  determined.  Given  this  assumption,  a  self-timed  [EEQ  scheduling 
strategy  can  be  employed,  where  the  ordering  of  the  actors  on  each  processor  is  speci¬ 
fied,  but  not  the  exact  start  times.  Each  processor  waits  for  the  data  needed  by  an  actor 
before  executing  that  actor.  This  requires  that  the  processors  perform  some  run-time 
synchronization  when  they  exchange  data,  so  the  run-time  overhead  is  greater  for  this 
scheduling  strategy.  Examples  of  an  application  graph  and  a  corresponding  self-timed 
schedule  are  illustrated  in  Figure  [2.9|. 

Another  consideration  in  scheduling  is  the  size  or  granularity  of  an  actor.  Figure  [23] 
shows  a  trade-off  between  parallelism  and  communication  overhead  in  a  heterogeneous 
DSP  system  as  the  size  of  the  actor  is  varied.  It  is  repeated  from  the  study  by  Sarkar  [1931] . 
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Figure  2.4:  An  example  of  an  application  graph  and  an  associated  self-timed  schedule. 
The  numbers  on  edges  (4,8)  and  (4,9)  denote  nonzero  delays. 


Figure  2.5:  Partition-overhead  trade-off  022]. 
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The  vertical  axis  is  a  measure  of  performance.  As  the  average  actor  size  is  increased,  the 
interprocessor  communication  (IPC)  overhead  drops.  At  the  same  time,  there  is  a  loss  of 
parallelism,  so  the  execution  time  for  an  ideal  parallel  system  (with  no  IPC)  increases. 
Partitioning  algorithms  try  to  find  the  optimal  balance  between  these  two  factors.  Sarkar 
developed  a  two-phase  scheduling  method.  The  first  phase  involved  scheduling  the  input 
graph  onto  an  ideal  architecture  in  which  there  are  no  resource  constraints  or  communi¬ 
cation  costs.  This  infinite-resource  multiprocessor  architecture  (IRMA)  consists  of 
an  infinite  number  of  processors  that  are  interconnected  by  a  fully-connected  crossbar 
interconnect  (an  interconnect  in  which  every  processor  is  directly  connected  to  every 
other  processor).  The  communication  in  the  IRMA  architecture  is  assumed  to  be  si¬ 
multaneous.  In  the  second  phase,  the  schedule  derived  for  the  IRMA  architecture  is 
modified  to  work  on  the  resource-constrained  target  architecture. 

For  a  system  with  fixed  resource  constraints,  the  multiprocessor  partitioning  and 
scheduling  problems  are  NP  hard  [S3],  so  heuristics  must  be  used.  Many  such  heuristics 
have  been  developed.  Most  existing  scheduling  heuristics  try  to  minimize  the  schedule 
makespan,  which  is  the  time  it  takes  for  all  the  tasks  to  finish  the  first  iteration  (execu¬ 
tion  of  one  schedule  period).  However,  most  DSP  applications  are  non-terminating;  an 
example  of  a  filter  operating  on  an  unbounded  stream  of  speech  samples.  In  this  case, 
it  is  more  appropriate  to  generate  schedules  that  maximize  the  throughput.  Schedul¬ 
ing  heuristics  can  be  classified  into  the  following  categories:  list  scheduling  heuristics, 
graph  decomposition  heuristics,  and  critical  path  heuristics. 

The  most  well-studied  area  in  scheduling  involves  heuristics  based  on  the  idea  of 
priority  lists  [BH].  These  heuristics  use  a  priority  list  to  define  an  ordering  of  the  nodes 
in  the  graph,  and  use  an  algorithm  that  selects  each  function  in  order  of  priority  for 
scheduling  on  an  appropriate  resource.  In  order  to  compute  the  priorities,  the  allocation 


21 


and  binding  steps  described  in  section  [Z2|  need  to  be  performed  in  advance. 

For  DFGs  with  edge  weights  and  node  weights,  a  path  weight  can  be  defined  as  the 
sum  of  the  weights  of  both  nodes  and  edges  on  the  path.  A  critical  path  from  a  source 
node  to  a  sink  node  is  a  path  with  maximal  weight.  In  the  critical  path  techniques,  the 
graph  is  partitioned  after  examining  the  current  critical  path,  zeroing  an  edge  by  com¬ 
bining  the  incident  nodes  into  a  cluster,  and  repeating  the  process  on  the  new  critical 
path.  In  the  dominant  sequence  clustering  algorithm  by  Yang  and  Gerasoulis  [EH],  the 
decision  to  zero  an  edge  is  based  on  the  new  start  time  of  the  node  at  the  beginning  of 
the  dominant  sequence  (the  critical  path  after  zeroing  of  one  or  more  edges)  and  the  start 
time  of  an  unscheduled  node  most  likely  to  be  affected  by  the  zeroing  decision.  If  either 
of  these  start  times  is  increased,  the  zeroing  is  not  done.  Due  to  the  relative  simplicity 
of  the  zeroing  criteria,  the  complexity  of  this  method  is  0((u  +  e )  log  v).  The  modified 
critical  path  algorithm  by  Wu  and  Gajski  [11  OKI]  considers  as-late-as-possible  binding, 
which  is  found  by  traversing  the  graph  from  the  sink  nodes  to  the  source  nodes  and  as¬ 
signing  the  latest  possible  start  time  to  each  node.  A  node  on  the  critical  path  is  selected 
and  placed  on  a  different  processor.  The  complexity  of  this  method  is  0(v2  log  u). 

2.4  Modeling  Self-Timed  Execution 

In  relation  to  the  scheduling  taxonomy  of  Lee  and  Ha  [ESI],  in  this  thesis  we  focus  on  the 
self-timed  strategy  and  variations  of  the  closely-related  ordered  transactions  strategy 
optimized  for  optically-connected  multiprocessors.  The  self-timed  and  ordered  trans¬ 
action  strategies  are  popular  and  efficient  for  the  DSP  domain  due  to  their  combina¬ 
tion  of  robustness,  predictability,  and  flexibility  [11011].  In  self- timed  scheduling,  each 
processor  executes  the  tasks  assigned  to  it  in  a  fixed  order  that  is  specified  at  compile 
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time.  Before  executing  an  actor,  a  processor  waits  for  the  data  needed  by  that  actor 
to  become  available.  Thus,  processors  are  required  to  perform  run-time  synchroniza¬ 
tion  when  they  communicate  data.  This  provides  robustness  when  the  execution  times 
of  tasks  are  not  known  precisely  or  when  they  may  exhibit  occasional  deviations  from 
their  compile-time  estimates.  Examples  of  an  application  graph  and  a  corresponding 
self-timed  schedule  are  shown  in  Figure  [2.4|. 

The  ordered  transaction  method  is  similar  to  the  self-timed  method,  but  it  also  adds 
the  constraint  that  a  linear  ordering  of  the  communication  actors  is  determined  at  com¬ 
pile  time,  and  enforced  at  run-time  [11021].  The  linear  ordering  imposed  is  called  the 
transaction  order  of  the  associated  multiprocessor  implementation.  The  transaction  or¬ 
der,  which  is  enforced  by  special  hardware,  obviates  run-time  synchronization  and  bus 
arbitration,  and  also  enhances  predictability.  Also,  if  constructed  carefully,  it  can  in  gen¬ 
eral  lead  to  a  more  efficient  pattern  of  actor/communication  operations  compared  to  an 
equivalent  self-timed  implementation  [62]. 

Next  we  will  examine  two  related  graph-theoretic  models,  the  interprocessor  com¬ 
munication  graph  (IPC  graph)  G'ipc  [11  OIL  [102]  and  the  synchronization  graph  Gs  [1102], 
that  are  used  to  model  the  self-timed  execution  of  a  given  parallel  schedule  for  a  dataflow 
graph.  Given  a  self-timed  multiprocessor  schedule  for  G,  we  derive  G'ipc  by  instantiat¬ 
ing  a  vertex  for  each  task,  connecting  an  edge  from  each  task  to  the  task  that  succeeds  it 
on  the  same  processor,  and  adding  an  edge  that  has  unit  delay  from  the  last  task  on  each 
processor  to  the  first  task  on  the  same  processor.  Also,  for  each  edge  (x,  y )  in  G  that 
connects  tasks  that  execute  on  different  processors,  an  IPC  edge  is  instantiated  in  G'ipc 
from  x  to  y.  Figure  [27]  shows  the  IPC  graph  that  corresponds  to  the  application  graph 
and  self-timed  schedule  of  \LA\.  In  this  graph,  the  nodes  labeled  with  “s”  are  nodes  that 
send  data  and  the  nodes  labeled  with  “r”  are  nodes  that  receive  data.  The  numbers  in 
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Proc  1 


Proc  2  Proc  3 


Figure  2.6:  The  IPC  graph  constructed  from  the  application  graph  and  schedule  of  Fig¬ 
ure  m  Dashed  edges  represent  IPC  edges  and  shaded  actors  are  communication  ac- 
tors(send  and  receive  actors)  that  perform  interprocessor  communication.  Numbers  next 
to  edges  represent  delays. 
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parentheses  represent  the  sending  and  receiving  actors.  For  example  s(5,  2)  represents 
a  communication  actor  with  data  from  actor  5  being  sent  to  actor  2. 


The  non-communication  vertices  in  Gs  and  G'ipc  correspond  to  individual  tasks  of 
the  application  being  implemented.  Each  edge  in  G'jpc  and  Gs  is  either  an  intraproces¬ 
sor  edge  or  an  interprocessor  edge.  Intraprocessor  edges  model  the  ordering  (specified 
by  the  given  parallel  schedule)  of  tasks  assigned  to  the  same  processor.  Interprocessor 
edges  in  GiPC,  called  IPC  edges ,  connect  tasks  assigned  to  distinct  processors  that  must 
communicate  for  the  purposes  of  data  transfer,  and  interprocessor  edges  in  Gs,  called 
synchronization  edges ,  connect  tasks  assigned  to  distinct  processors  that  must  commu¬ 
nicate  for  synchronization  purposes.  We  will  discuss  the  synchronization  graph  in  more 
detail  in  Chapter  |5|. 

Each  edge  ( Vj,Vi )  in  G'ipc  represents  the  synchronization  constraint 


start (uj,  k)  >  end(fj,  k  —  delay ( (vj,  Vi)))  Vfc, 


(2.1) 


where  start k)  and  end(u,  k)  respectively  represent  the  time  at  which  invocation  k 
of  actor  v  begins  execution  and  completes  execution,  and  delay  (e)  represents  the  delay 
associated  with  edge  e. 

The  IPC  graph  is  an  instance  of  Reiter’s  computation  graph  model  [SDJ,  also  known 
as  the  timed  marked  graph  model  in  Petri  net  theory  [E5I],  and  from  the  theory  of  such 
graphs,  it  is  well  known  that  in  the  ideal  case  of  unlimited  bus  bandwidth,  the  average 
iteration  period  for  the  as-soon-as-possible  (ASAP)  execution  of  an  IPC  is  given  by  the 
maximum  cycle  mean  (MCM)  of  G'ipc,  which  is  defined  by 


The  MCM  can  be  computed  efficiently — Karp’s  algorithm  [5EJ  runs  in  Tfr/rn)  time 
where  n  is  the  number  of  actors  in  the  graph  and  m  is  the  number  of  edges.  Dasdan  and 
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Gupta  describe  an  algorithm  based  on  Karp’s  algorithm  that  runs  in  (worst  case) 
0{nm),  and  always  faster  than  Karp’s  algorithm. 

2.5  Interconnect  Synthesis 

SoC  design  is  moving  toward  a  paradigm  where  reusable  components  called  IP  (for 
intellectual  property)  from  different  vendors  can  be  combined  to  rapidly  create  a  design. 
IP  interface  standards  are  being  developed  which  define  the  services  one  IP  component 
(or  IP  core)  is  capable  of  delivering,  and  which  enable  IP  cores  to  work  with  on-chip 
buses  and  other  interconnection  networks.  The  SoC  designer’s  task  is  then  to  choose 
the  appropriate  IP  cores,  map  the  application  tasks  onto  these  cores,  and  to  construct  a 
communication  network  and  corresponding  glue  logic  to  connect  these  IP  cores. 

Interconnect  synthesis  is  becoming  an  increasingly  important  part  of  system-level 
synthesis,  given  the  larger  number  of  blocks  that  must  be  interconnected  and  the  in¬ 
creasing  importance  of  interconnect  delay  to  overall  performance.  To  date,  shared  bus 
has  been  the  dominant  interconnect.  However,  researchers  are  now  exploring  a  richer 
set  of  interconnection  schemes,  including  crossbars,  meshes,  and  other  point-to-point 
topologies.  We  will  explore  interconnect  synthesis  in  detail  in  Chapter  0. 
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Chapter  3 


System  Architectures  for  Multiprocessor 
Embedded  Systems 


There  has  been  substantial  research  work  in  the  areas  of  multiprocessor  hardware  and 
software  for  high  performance,  general  purpose  computing.  These  machines  tend  to  be 
big  and  expensive,  and  are  targeted  toward  solving  large  computational  problems  such 
as  climate  simulation.  As  mentioned  in  the  Introduction,  embedded  systems  can  also 
utilize  multiprocessor  architectures,  and  some  research  work  has  focused  on  developing 
application-specific  multiprocessor  systems.  Since  these  systems  only  need  to  support  a 
limited  number  of  programs,  it  is  often  possible  to  streamline  the  hardware  architecture. 
We  will  focus  on  systems  running  applications  that  can  be  described  by  dataflow  graphs. 
In  these  applications,  parallelism  is  easier  to  identify  and  exploit  because  much  more  is 
known  about  the  structure  of  the  computation. 

We  will  discuss  the  role  that  optical  interconnects  can  play  in  embedded  multi¬ 
processor  systems,  and  derive  some  fundamental  equations  relating  to  optically  con¬ 
nected  systems  on  chip.  We  will  introduce  three  architectures  on  which  a  broad  class  of 
high-throughput,  self-timed  DSP  applications  can  be  analyzed  accurately  using  efficient 
graph-theoretic  algorithms. 
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3.1  Multiprocessor  Program  Execution  Models 


For  sequential  computers,  the  principal  execution  model  in  use  today  is  the  von  Neu¬ 
mann  model  which  consists  of  a  sequential  process  running  in  a  linear  address  space  [S5fl. 
In  1966,  Flynn  [BTIJ  proposed  a  simple  model  of  categorizing  multiprocessors  using  this 
execution  model  as  either  Single  Instruction  Multiple  Data  (SIMD)  or  Multiple  Instruc¬ 
tion  Multiple  Data  (MIMD)  according  to  how  they  partition  control  and  data  among 
different  processing  elements.  In  a  SIMD  machine  the  same  instruction  is  executed  by 
multiple  processors  using  different  data  streams.  Each  processor  has  its  own  data  mem¬ 
ory,  but  there  is  a  single  instruction  memory  and  control  processor.  In  a  MIMD  machine, 
each  processor  fetches  its  own  instructions  and  operates  on  its  own  data.  Using  this  ter¬ 
minology,  we  would  call  a  uniprocessor  a  single  instruction,  single  data  stream  (SISD) 
machine.  MIMD  machines  fall  into  two  categories — centralized  shared-memory  archi¬ 
tectures  and  distributed  memory  architectures.  Figure  [FT]  [i55ij  depicts  the  basic  structure 
of  a  centralized  shared-memory  multiprocessor,  where  the  processors  and  memory  are 
connected  by  a  shared  bus.  Processors  communicate  by  writing  and  reading  from  loca¬ 
tions  in  memory.  In  order  to  reduce  the  memory  bandwidth  requirement  of  the  proces¬ 
sors,  memory  cache  is  used.  We  may  classify  the  data  in  the  multiprocessor  as  private 
data  if  it  is  only  used  by  a  single  processor,  or  shared  data  if  it  is  used  by  multiple 
processors.  The  communication  mechanism  utilizes  shared  data.  When  data  is  migrated 
into  a  processor’s  cache,  the  bus  bandwidth  is  reduced  since  this  processor  does  not 
need  to  access  main  memory  to  fetch  the  data.  Also,  memory  access  time  to  cache  is 
faster  than  to  main  memory.  When  the  data  is  private  data,  the  program  execution  is 
not  affected.  However,  when  shared  data  are  cached,  the  data  may  be  stored  in  multi¬ 
ple  caches.  This  complicates  the  program  execution,  since  there  must  be  some  way  to 
reconcile  the  different  copies  of  the  data.  This  problem  is  called  cache  coherence ,  and 
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Figure  3.1:  Basic  structure  of  a  centralized  shared-memory  multiprocessor.  Multiple 
processor-cache  subsystems  share  the  same  physical  memory,  typically  connected  by  a 
bus. 
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Figure  3.2:  Basic  structure  of  a  distributed-memory  multiprocessor.  Individual  nodes 
contain  a  processor,  some  memory,  and  an  interface  to  an  interconnection  network  that 
connects  all  the  nodes.  Individual  nodes  may  themselves  contain  a  small  number  of 
processors  interconnected  via  a  bus  or  other  interconnect  which  is  often  less  scalable 
than  the  global  interconnection  network. 

has  been  well  studied  in  general  purpose  computing  [HQ.  For  some  embedded  systems 
applications  the  cache  is  eliminated  in  order  to  reduce  complexity  and  cost. 

Figure  [37Z]  [K51J  depicts  a  distributed-memory  machine,  which  has  a  physically  dis¬ 
tributed  memory.  These  machines  typically  have  larger  processor  counts,  where  a  shared 
bus  cannot  handle  the  required  communication  bandwidth.  Distributing  the  memory  re¬ 
duces  the  latency  for  access  to  the  local  memory.  Compared  to  the  shared-memory 
architecture,  communication  between  processors  is  more  complex. 

3.2  Architectures  Based  on  Dataflow 

In  the  von  Neumann  architecture,  all  the  data,  the  locations  of  the  data,  and  the  opera¬ 
tions  to  be  performed  on  the  data,  must  travel  between  memory  and  CPU  a  word  at  a 
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time.  This  has  been  termed  the  “von  Neumann  bottleneck”  [0J.  Hardware  architectures 
based  on  dataflow  have  been  studied  in  order  to  avoid  this  bottleneck.  The  dataflow 
model  of  computation  was  discussed  in  Chapter  0.  Dataflow  models  use  dataflow  pro¬ 
gram  graphs  to  represent  the  flow  of  data  and  control.  In  this  model  an  instruction  may 
be  executed  (or fired)  as  soon  as  all  its  input  operands  are  available.  When  an  instruction 
fires,  it  consumes  its  input  values  and  generates  some  output  values.  Because  of  this, 
the  dataflow  model  is  asynchronous.  In  a  dataflow  architecture  the  program  execution 
involves  receiving,  processing,  and  sending  out  tokens  containing  data  and  a  tag.  Depen¬ 
dencies  between  data  are  translated  into  tag  matching  and  transformation.  Processing 
occurs  when  a  set  of  matched  tokens  arrives  at  the  execution  unit.  The  matching  unit 
and  execution  unit  are  connected  by  queues.  Several  types  of  architectures  based  purely 
on  dataflow  have  been  studied  in  the  past.  They  differ  in  how  the  tokens  are  handled. 

The  single  token  per  arc  dataflow  architecture  was  proposed  by  Dennis  [B4IJ.  In  this 
architecture,  a  dataflow  graph  is  represented  by  a  number  of  activity  templates,  each 
containing  an  instruction  and  operand  slots  for  holding  operand  values.  Only  one  token 
is  allowed  at  a  time  on  an  arc.  Acknowledge  signals  are  used  to  enforce  the  single 
token  rule,  making  it  relatively  simple  to  detect  when  a  node  is  enabled.  The  MIT  Static 
Dataflow  Architecture  LI33I]  was  a  direct  implementation  of  this  model.  One  disadvantage 
of  this  architecture  is  that  consecutive  iterations  of  a  loop  can  only  partially  overlap  in 
time.  Another  is  the  additional  token  traffic  caused  by  the  acknowledgment  tokens. 

The  tagged-token  dataflow  model  [11071]  was  created  to  allow  loop  iterations  to  pro¬ 
ceed  in  parallel.  In  this  model,  each  token  contains  a  tag  that  defines  the  context  in  which 
the  data  value  will  be  used.  Multiple  tokens  are  allowed  on  an  arc.  A  node  is  enabled  as 
soon  as  tokens  with  identical  tags  are  present  on  each  of  its  input  arcs.  Several  groups 
produced  prototype  implementations  of  this  model  [O,  H,  11041].  A  disadvantage  of  this 
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model  is  that  it  is  difficult  to  implement  an  efficient  unit  to  handle  the  overhead  of  token 
matching. 

It  was  found  that  computers  implementing  pure  dataflow  performed  poorly  on  se¬ 
quential  code.  This  is  due  in  part  to  the  fine  granularity — tasks  correspond  to  operations 
such  as  a  simple  multiply  or  compare,  and  the  overhead  associated  with  token  matching 
of  these  tasks.  One  solution  to  this  problem  is  to  combine  a  dataflow  architecture  with 
a  von  Neumann  architecture.  Most  recent  dataflow  architectures  employ  a  coarse-grciin 
model  in  which  the  computational  task  of  the  dataflow  actors  consist  of  a  number  of 
instructions;  the  computation  inside  each  task  is  executed  on  a  von  Neumann  processor 
(often  a  commercial  off-the-shelf  processor),  and  the  actors  communicate  and  synchro¬ 
nize  according  to  dataflow  semantics.  This  is  shown  conceptually  in  Figure  [O]  [jQSIJ. 


3.3  Architectures  Utilizing  Optical  Interconnects 

In  future  CMOS  chip  designs  incorporating  hundreds  of  millions  of  transistors,  the  wire 
interconnect  will  become  a  limiting  factor,  both  in  terms  of  area  overhead  and  delay.  Op¬ 
tical  interconnects  offer  the  potential  to  relieve  this  bottleneck.  In  this  section  we  will 
summarize  some  past  work  in  optical  interconnects  and  optically  connected  architec¬ 
tures,  and  introduce  two  new  architectures  we  have  developed  specifically  as  a  platform 
on  which  to  map  DSP  applications  described  as  dataflow  graphs. 

3.3.1  Optical  Interconnect  Technology 

In  recent  years,  optics  have  played  an  increasing  role  in  multiprocessor  systems.  Com¬ 
mercial  high-performance  computers  now  use  fiber  ribbons  to  connect  multiple  pro- 
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(C) 


Figure  3.3:  Comparison  of  execution  models:  (a)  von  Neumann  (control  flow)  (b)  pure 
dataflow  (c)  coarse-grain  model  with  dataflow  graph  and  fully  ordered  grains. 
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cessing  nodes  [E3J.  Other  examples  include  storage  area  networks  using  fiberchannel, 
and  optical  clock  distribution  to  reduce  clock  skew  rate  across  a  chip  [E23J.  Optical 
technology  has  been  advancing  rapidly,  driven  in  large  part  by  the  optical  communi¬ 
cations  equipment  market.  Various  studies  have  predicted  that  the  energy  consumed 
by  data  communication  will  ultimately  limit  the  processing  speed  in  electronic  proces¬ 
sors  [f79L  100].  Light  signals  do  not  suffer  from  effects  such  as  electromagnetic  in¬ 
terference  and  capacitive  effects,  which  limit  electrical  interconnects.  While  transistor 
gate  delay  decreases  linearly  with  decreases  in  minimum  feature  size,  the  wire  delay 
increases  as  wires  become  thinner.  In  addition,  the  cross-sectional  area  of  metal  wires 
must  increase  with  length  to  maintain  acceptable  attenuation.  By  contrast,  an  optical 
channel  has  a  constant  transverse  area  of  A2,  where  A  is  the  wavelength  of  the  light  [S3]. 
Thus  beyond  a  certain  transmission  length,  optical  interconnections  become  favorable. 
This  break-even  length  is  estimated  to  be  between  0.1mm  and  1cm  [E3|. 

There  has  been  theoretical  work  [B7I]  that  has  established  that  arbitrary  connection 
graphs  can  be  realized  with  an  effective  interconnection  density  of  1/A2  using  optics. 
At  these  densities,  heat  removal  will  be  the  limiting  factor  [83]]. 

Several  studies  03S,  E3]  have  addressed  the  question  of  what  is  the  best  size  for 
a  VLSI  processor  connected  by  optics.  These  have  concluded  that  the  system  should 
be  partitioned  into  clusters  of  104  to  1012  transistors.  This  allows  the  design  to  reach 
points  in  the  design  space  that  are  not  achievable  without  optics.  However,  there  may  be 
significant  power  and  space  costs.  If  size  and  power  are  the  primary  objectives,  optical 
systems  become  advantageous  only  with  extremely  large  systems  [83]. 

The  main  advantage  of  optics  for  a  multiprocessor  system  is  that  it  allows  highly 
parallel  data  links  and  a  large  degree  of  connectivity  between  processors. 
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3.3.2  Prototype  Optically-Connected  Systems 


Several  research  groups  have  demonstrated  optically-connected  multiprocessor  systems 
(e.g.,  see  [SB,  48,  76,  IZZO).  Some  of  these  systems  are  based  on  free-space  optical  inter¬ 
connects,  while  others  are  based  on  wavelength  division  multiplexing  (WDM).  WDM 
systems  typically  utilize  fiber  or  waveguide  interconnects,  and  are  advantageous  for 
hybrid  integration  of  independent  modules.  The  strength  of  a  free-space  optical  inter¬ 
connect  scheme  is  its  potential  to  provide  an  extremely  high  density  of  interconnections, 
such  as  will  be  required  for  a  single-chip  system. 

An  example  of  a  system  utilizing  free-space  optical  interconnects  is  the  FAST-Net 
prototype  [US].  FAST-Net  is  a  high  throughput  data  switching  concept  that  uses  a  re¬ 
flective  optical  system  to  globally  interconnect  a  multichip  array  of  processors.  The 
three-dimensional  optical  system  links  each  chip  directly  to  every  other  with  a  dedi¬ 
cated  bidirectional  parallel  data  path.  The  system  utilizes  smart-pixel  arrays  (SPA),  in 
which  high  density  silicon  electronics  are  integrated  with  two-dimensional  arrays  of 
high  speed  Gallium  Arsenide  micro-laser/detector  arrays.  An  array  of  SPAs  is  packaged 
on  a  planar  substrate  and  linked  to  itself  through  an  optical  system  composed  of  a  lens 
array  and  a  mirror.  This  concept  provides  internal  bisection  bandwidth  [1711]  on  the  order 
of  1012  bits  per  second.  Figure  [T4]  depicts  the  SPA  and  the  optical  imaging  system. 

Compiler  technology  and  automated  mapping  tools  for  these  systems  have  received 
relatively  less  attention  than  the  hardware.  Seo  and  Chatterjee  [IQ4lj  presented  a  CAD  tool 
for  physical  placement  of  modules  in  SoC  utilizing  optical  interconnects.  The  tool  de¬ 
termined  which  interconnects  should  be  routed  electrically  and  which  should  be  routed 
optically.  They  reported  a  50%  reduction  in  worst  case  interconnect  delay  over  using  all 
metallic  interconnects. 
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Figure  3.4:  FASTNet  prototype. 
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3.4  Optically  Connected  System  on  Chip 


In  this  section  we  will  examine  several  fundamental  design  considerations  for  systems 
on  chip  utilizing  optical  interconnects.  Our  general  model  for  a  system-on-chip  (SoC) 
is  one  in  which  the  chip  is  partitioned  into  regions  that  are  connected  with  metallic 
(local)  interconnects,  and  these  local  regions  are  then  connected  through  optical  (global) 
interconnects  m.  As  mentioned  in  the  Introduction,  the  applications  we  consider  can 
be  modeled  by  dataflow,  and  consist  of  task  graphs,  where  the  individual  tasks  must  fit 
fully  into  a  local  region.  The  graph  vertices  (tasks  or  nodes)  in  the  acyclic  task  graphs 
represent  computations  while  the  edges  represent  the  communication  of  a  packet  of  data 
from  a  source  task  to  a  sink  task. 

Three  fundamental  design  considerations  for  such  a  system  are  addressed  in  this 
thesis: 

•  What  is  the  optimum  size  of  a  local  partition? 

•  What  techniques  should  we  use  to  map  and  schedule  tasks  on  these  partitions? 

•  How  do  we  synthesize  an  optimum  global  (optical)  interconnection  network  for 
the  system? 

These  considerations  are  interrelated,  since  the  size  of  the  local  partition  will  affect 
the  maximum  size  (granularity)  of  the  tasks,  and  the  scheduling  of  tasks  depends  on  the 
interconnection  network.  This  section  will  focus  on  the  question  of  optimal  partition 
size.  Scheduling  is  addressed  in  Chapter  ^  and  interconnect  synthesis  is  covered  in 
Chapter  0. 
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3.4.1  Global/Local  Partitioning 


This  section  presents  an  information-theoretical  model  for  trade-offs  in  designing  the 
local  partition  of  a  SoC  utilizing  free-space  optics.  As  mentioned  earlier,  free-space  op¬ 
tical  interconnects  can  provide  higher  interconnect  densities  than  other  types  of  optical 
interconnects.  These  trade-offs  are  fundamental  in  nature  and  will  exist  in  any  system 
utilizing  these  interconnects. 

These  systems  utilize  arrays  of  vertical  cavity  surface  emitting  laser  (VCSEL)  trans¬ 
mitters  and  photoreceivers  to  implement  the  interconnect.  A  single  interconnect  consists 
of  a  VCSEL/photoreceiver  pair.  Light  from  the  VCSEL  must  be  directed  to  and  imaged 
on  the  appropriate  photoreceiver.  This  is  depicted  for  the  FAST-Net  system  in  Fig¬ 
ure  P3|.  Different  systems  use  different  imaging  methods  to  accomplish  this.  The  high 
density  of  interconnections  arises  from  the  use  of  the  third  dimension  (free-space)  and 
the  fact  that  overlapping  optical  signals  do  not  interfere  with  each  other  (i.e.,  there  is  no 
crosstalk  in  free  space). 

As  the  dimensions  of  the  local  partition  decrease,  higher  f-number  lenses  are  re¬ 
quired  to  collect  the  light  from  the  transmitters  in  a  constant  focal-length  system.  (The 
f-number  of  a  lens  is  defined  as  its  focal  length  divided  by  its  diameter).  Figure  [TT| 
depicts  the  diffraction-limited  images  of  an  array  of  point  sources,  in  a  random  on/off 
pattern,  on  an  array  of  photodetectors.  The  data  for  the  figure  was  generated  using  MAT- 
LAB  to  compute  the  diffraction  pattern  for  F/l  lenses  (left)  and  F/2  lenses  (right).  Using 
an  optical  system  with  f-number  F  and  treating  the  transmitter  as  a  point  source  operat¬ 
ing  at  wavelength  A,  the  diffraction-limited  image  of  the  source  on  the  detector  is  given 
by  the  expression 


(3.1) 


where  p  is  the  radius  from  the  center  of  the  image  and  J0  is  proportional  to  the  source 
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about  the  mirror  plane  for  the  FAST-Net  system. 
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Figure  3.6:  An  array  of  point  sources  imaged  using  f/1  optics  (left)  and  f/2  optics  (right). 
The  left  and  right  pictures  are  different  scales — the  partitions  on  the  left  are  twice  the 
length  of  the  partitions  on  the  right. 


intensity.  The  function  J\  is  a  first  order  Bessel  function  of  the  first  kind. 

From  this  equation,  the  signal  received  by  the  center  channel  for  this  pattern  can  be 
calculated  by  spatially  integrating  over  the  corresponding  photodetector.  This  calcula¬ 
tion  will  also  take  into  account  the  inter-pixel  interference  (IPI).  We  then  vary  the  pattern 
randomly  to  generate  the  conditional  probability  distributions  for  the  center  channel.  If 
we  assume  that  the  IPI  is  only  significant  between  adjacent  channels,  we  can  use  the 
conditional  probabilities  to  assess  the  mutual  information  corresponding  to  a  channel  be¬ 
tween  partitions.  As  partition  size  decreases,  and  the  associated  aperture  sizes  decrease 
(increasing  the  f-number),  the  optical  signal  intensity  decreases  and  the  IPI  increases. 
Both  effects  reduce  the  mutual  information.  We  can  then  characterize  the  mutual  in¬ 
formation  as  a  function  of  partition  size,  and  therefore,  the  number  of  partitions.  The 
mutual  information  between  each  source  and  its  corresponding  detector  is  given  by 


Imut  (A';  Y)  =  >_]  p(y\X 
i=0,l 


i)  log2 


p(y\X  =  i ) 

p(y) 


dy 


(3.2) 


where  p(y\X  =  i )  is  the  conditional  probability  that  a  value  y  is  received  when  i  is 
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Figure  3.7:  Trade-Off  between  partition  size,  global  data  rate,  and  local  data  rate. 

transmitted  and  p(y)  is  the  probability  density  function  (PDF)  of  y. 

Restoring  the  mutual  information  required  for  the  application  can  be  achieved  by 
decreasing  the  bit  rate  and  integrating  over  a  longer  clock  cycle  in  order  to  increase  the 
signal- to-noise  ratio.  We  define  the  information  capacity,  or  data  rate,  as  the  product  of 
the  mutual  information  and  the  bit  rate.  Therefore,  it  can  be  generally  shown  that  in¬ 
creasing  the  number  of  partitions  on  a  chip  will  lead  to  lower  global  data  rate  across  the 
chip.  At  the  same  time,  smaller  partitions  will  reduce  the  length  requirements  on  local 
interconnections  (intra-partition)  performed  electrically.  Therefore,  local  interconnect 
data  rates  can  benefit  from  reduced  partition  size.  We  assume  that  the  data  rate  is  in¬ 
versely  proportional  to  the  RC  time  constant,  which  in  turn  is  proportional  to  the  square 
of  the  interconnection  length.  A  simple  approximation  then  results  in  a  factor  a /N  de¬ 
crease  in  local  interconnect  length,  therefore,  a  factor  N  increase  in  the  local  data  rate, 
where  N  is  the  number  of  partitions.  These  opposing  effects  of  partition  size  suggest  a 
trade-off  between  the  local  and  global  data  rates,  which  is  illustrated  hypothetically  in 
Figure  |377!  and  thus  an  optimum  partitioning  of  the  SoC.  This  is  the  crossing  point  of 
the  two  curves  in  Figure  [3.7! 
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3.4.2  Typical  Numbers 

We  next  give  some  estimates  of  system  parameters  based  on  today’s  components.  The 
optical  channel  density  on  the  chip  will  impose  a  fundamental  upper  limit  on  the  number 
of  partitions,  M,  for  the  SoC.  For  a  chip  with  dimensions  LxL.  the  number  of  optical 
channels  N  will  be  given  by  Ar  <  L2  /  2d2  where  d  is  the  VCSEL  and  detector  pitch. 
For  a  full  crossbar  connection,  N  =  M(M  —  1).  For  a  “typical”  VCSEL  pitch  of  125 
microns,  this  implies  that  we  would  be  limited  to  57  partitions  for  a  one  square  centime¬ 
ter  chip.  The  power  requirements  depend  on  the  architecture,  but  some  insight  can  be 
gained  by  considering  examples.  Let  P0  represent  the  power  required  to  drive  a  VCSEL  - 
detector  pair.  If  every  partition  is  transmitting  and  receiving  data,  the  total  optical  power 
is  given  by  the  number  of  partitions  times  the  number  of  VCSEL-detector  pairs  trans¬ 
mitting  per  partition  times  /  q.  The  upper  limit  of  power  consumption  corresponds  to  the 
case  in  which  all  VCSEL-detector  pairs  are  operating.  Therefore,  P  <  L2  /  2d2  Pq.  The 
lower  limit  to  the  power  consumption  corresponds  to  the  case  where  only  one  pair  per 
partition  is  transmitting  at  any  instant  of  time,  which  implies  P  >  MP0.  If  we  assume 
P0  =  10  mW  and  57  partitions,  then  the  total  power  consumption  would  be  32W  for 
the  one  square  centimeter  chip  in  the  most  demanding  case  and  570mW  for  the  least 
demanding  case. 

The  one-way  data  rate  between  two  partitions  is  given  by  the  data  rate  per  VCSEL- 
detector  pair,  D0,  times  the  number  of  pairs:  -Dpartition  =  L2 /2d2M(M  —  1  )D0.  For 
D0  =  2.5  Gbps,  -Dpartition  =  4  Tbps  in  a  two-partition  architecture.  In  the  case  of  a  single 
VCSEL-detector  pair  per  cluster,  the  partition  data  rate  is  equal  to  the  channel  data  rate 
at  2.5  Gbps,  with  an  aggregate  data  rate  of  142.5  Gbps  for  57  partitions. 
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3.5  Modeling  Optically-Interconnected  Systems  with  Syn¬ 


chronization  Graphs 

A  graph-theoretic  framework,  called  the  synchronization  graph,  for  analyzing  arbitrary 
algorithm-to-architecture  mappings  is  given  in  [11  Oil].  In  this  section  we  describe  three 
architectures  we  developed  to  take  full  advantage  of  the  analytical  properties  of  this 
framework.  The  synchronization  graph  applies  to  any  hardware  architectural  model  that 
includes  the  following  assumptions: 

•  For  each  computational  task  (dataflow  node),  a  reasonably  accurate  estimate  ex¬ 
ists  for  the  execution  time  of  a  task,  and  this  execution  time  exhibits  little  or  no 
variation  with  input  data. 

•  Once  a  communication  link  is  reserved  for  a  specific  data  packet,  the  link  remains 
reserved  exclusively  for  that  packet  until  transfer  of  the  packet  completes. 

•  The  transit  time  of  data  packets  through  the  interconnection  network,  once  a  com¬ 
munication  link  has  been  reserved  for  the  transfer,  is  deterministic. 

If  we  assume  that  the  time  required  to  perform  interprocessor  communication  is  zero, 
then  the  synchronization  graph  work  shows  that  the  throughput  of  a  given  algorithm-to- 
architecture  mapping  can  be  determined  accurately  by  an  efficient  graph-theoretic  tech¬ 
nique  013].  If  the  interprocessor  communication  is  nonzero,  the  technique  gives  an  upper 
bound  to  the  throughput.  The  tightness  of  this  upper  bound  depends  on  the  ratio  of  inter¬ 
processor  communication  time  to  average  task  execution  time.  In  optically-connected 
multiprocessor  systems,  we  can  expect  that  this  ratio  is  small.  This  is  a  particularly  good 
assumption  if  the  implementation  guarantees  that  there  is  never  contention  among  the 
processors  for  an  optical  link.  The  existence  of  an  accurate,  efficient  throughput  analy- 
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sis  technique  opens  up  the  possibility  of  developing  improved  algorithm-to-architecture 
mapping  techniques,  which  we  explore  in  this  thesis. 

3.5.1  SLOT  Architecture 

We  developed  an  architecture,  which  we  call  SLOT  (Self-timed  Locally  Ordered  Trans¬ 
action),  on  which  a  broad  class  of  high-throughput,  self-timed  DSP  applications  can 
be  analyzed  accurately  using  efficient  algorithms  based  on  the  synchronization  graph 
framework.  SLOT  enables  the  development  of  powerful  tools  for  automatic  applica¬ 
tion  mapping  (compilation).  The  interprocessor  connectivity  requirements  of  SLOT  are 
large,  and  thus  optical  interconnect  technology  appears  to  be  a  natural  match  for  SLOT 
systems.  In  particular,  a  general-purpose  slot  architecture  requires  that  each  processor 
have  a  dedicated  communication  channel  for  each  processor  with  which  it  communi¬ 
cates.  Figure  [D|  gives  a  graphical  representation  of  this  architecture. 

SLOT  architectures  can  be  composed  of  arbitrary,  possibly  heterogeneous,  collec¬ 
tions  of  processing  elements,  such  as  DSP  processors,  FPGA  or  ASIC  subsystems,  mi¬ 
croprocessors,  and  microcontrollers.  When  a  processor  is  embedded  within  a  SLOT 
architecture,  one  or  more  communication  processors  are  used  to  interface  the  processor 
to  the  rest  of  the  multiprocessor  system.  Each  communication  processor  is  assigned  a 
pre-defined  ordering  of  the  interprocessor  communication  operations  (send  and  receive 
operations  to  and  from  other  processors)  that  are  required  to  interface  the  associated 
(computation)  processor.  These  local  orderings  of  communication  operations,  on  the 
communication  processors  within  a  SLOT  system,  are  repeated  over  and  over  again 
based  on  the  arrival  of  data  (from  the  associated  computation  processors,  or  from  other 
communication  processors).  A  group  of  communication  processors  can  also  be  “clus¬ 
tered  together”  without  an  associated  computation  processor.  Such  clusters  of  communi- 
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cation  processors  serve  as  routers  that  provide  additional  communication  paths  between 
remote  computation  processors.  Figure  [TK]  depicts  one  computation  and  communica¬ 
tion  processor  in  a  four-processor,  fully  connected  system.  A  dedicated  laser  transmitter 
is  required  for  each  other  processor  with  which  this  processor  must  send  data.  Also,  a 
dedicated  photodiode  receiver  and  buffer  memory  is  required  for  each  processor  from 
which  the  processor  receives  data.  With  this  architecture,  there  is  no  contention  for  com¬ 
munication  resources,  and  the  synchronization  graph  models  the  system  accurately.  This 
architecture  is  particularly  well  suited  to  be  implemented  in  free-space  optical  systems 
such  as  FAST-Net.  As  mentioned  above,  one  advantage  of  free  space  interconnects  is  the 
high  density  of  interconnects  that  can  be  achieved.  If  we  were  to  replace  the  processing 
element  in  the  FAST-Net  prototype  with  the  combination  of  multiplexers,  communica¬ 
tion  processor,  and  computation  processor  from  Figure  P~~8|.  SLOT  could  be  implemented 
using  the  FAST-Net  optical  imaging,  packaging,  and  smart  pixel  array  hardware. 

3.5.2  Dedicated  Channel  Fiber  WDM  Architecture 

One  disadvantage  of  free-space  optical  systems  is  that  they  are  very  sensitive  to  align¬ 
ment.  The  alignment  of  the  optical  paths  described  in  Section  |.L4.  I|  from  each  laser 
transmitter  to  the  correct  photodiode  receiver  may  be  difficult  and  not  robust  under  some 
operating  environments  (due  to  vibration,  temperature  changes,  etc.).  Fiber-based  archi¬ 
tectures  do  not  suffer  from  this  problem — the  VCSEL-to-fiber  and  fiber-to-photodiode 
interface  has  proven  to  be  very  robust  in  commercial  systems.  Here  we  describe  a  fiber- 
based  implementation  of  SLOT. 

In  this  implementation,  we  need  to  assign  a  unique  wavelength  to  each  communi¬ 
cation  channel.  We  define  the  processor  graph  Gp  as  a  directed  graph  in  which  the 
nodes  represent  the  processors  in  the  system  and  the  edges  represent  connections  be- 
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Figure  3.9:  Architecture  for  contention-free  fiber-based  SLOT. 

tween  processors — if  processor  i  transmits  data  to  processor  j,  there  is  an  edge  (i,j)  in 
Gp.  This  is  essentially  the  WDM  equivalent  to  the  free  space  interconnect  since  there  is 
a  dedicated  channel  between  every  pair  of  processors.  Physical  constraints  will  usually 
place  limits  on  the  fan-out  and  fan-in  of  the  processors.  Fan-out  of  a  processor  p  is  de¬ 
fined  as  the  out-degree  of  node  p  in  Gp,  while  fan-in  is  defined  as  the  in-degree  of  node 
p  in  Gp.  We  will  define  the  maximum  allowed  fan-out  as  /out  and  the  maximum  allowed 
fan-in  as  /in.  Figure  [T9|  depicts  this  implementation. 

The  advantage  of  this  implementation  is  that  there  are  no  central  controllers  required. 
This  architecture  allows  a  direct  implementation  of  the  synchronization  graph.  A  disad¬ 
vantage  is  the  number  of  wavelengths  required — for  a  system  with  n  processors,  there 
are  n 2  wavelengths  required,  or  nfout  wavelengths  required  if  we  place  a  constraint  on 
the  fanout. 

3.5.3  One  Wavelength  Per  Processor 

In  order  to  reduce  the  number  of  wavelengths  required,  we  can  implement  a  protocol  in 
which  each  processor  is  assigned  a  unique  wavelength.  In  this  system,  we  must  ensure 
that  two  processors  do  not  send  to  a  given  processor  at  the  same  time — i.e.,  there  is 
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Figure  3.10:  Architecture  for  wavelength  ordered  transactions. 

possible  contention  at  the  (single)  receiver  of  every  processor.  In  order  to  accomplish 
this,  we  introduce  a  controller  for  every  wavelength.  This  controller  grants  access  to 
only  one  processor  at  a  time.  Figure  |3.10|  depicts  this  implementation. 

In  this  architecture,  processor  m  receives  data  on  its  uniquely  assigned  wavelength 
Am.  In  order  to  grant  Am  to  processor  q,  the  controller  for  Am  sends  the  number  q  on 
its  grant  output  (which  is  at  wavelength  Am).  The  controller  has  an  acknowledgment 
(ACK)  receiver  for  every  processor  to  which  it  grants  access.  The  communication  pro¬ 
cessors  must  wait  to  be  granted  access  to  a  particular  wavelength  before  transmitting 


48 


on  that  wavelength.  The  number  of  grant  lines  entering  a  communication  processor  is 
equal  to  fout,  since  there  must  be  a  grant  for  each  A,  and  a  processor  transmits  on  a  max¬ 
imum  of  fout  wavelengths.  When  a  processor  p  has  completed  transmission  on  a  given 
wavelength,  say  Xr,  it  sends  an  acknowledgment  consisting  of  the  number  r  on  wave¬ 
length  A p.  The  number  of  ACK  lines  entering  a  wavelength  controller  x  is  equal  to  the 
fan-out  fout  for  processor  x.  One  advantage  of  this  architecture  is  that  it  requires  fewer 
wavelengths — n  wavelengths  are  required  as  opposed  to  n2  or  nfout.  One  disadvantage 
is  that  it  is  more  complicated  and  n  controllers  are  required.  Also  the  throughput  may  be 
lower  since  the  system  is  more  constrained — we  have  the  same  synchronization  graph 
as  before  with  extra  edges  added  for  the  grants  and  acknowledgments.  We  refer  to  this 
architecture  as  wavelength  division  multiplexing  ordered  transactions  (WDMOT). 

We  will  examine  the  theoretical  performance  of  the  three  architectures  described 
above  in  Chapter  |5j. 
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Chapter  4 


Contention  Analysis  in  Shared  Bus  Systems 
Utilizing  the  Period  Graph 

4.1  Contention  in  Shared  Bus  Systems 

In  many  practical  multiprocessor  systems,  there  is  contention  for  one  or  more  shared 
communication  resources.  One  example  of  this  is  a  shared  bus,  in  which  the  processors 
must  first  gain  access  to  the  bus  before  they  can  execute  an  interprocessor  communi¬ 
cation  (IPC)  operation.  Figure  [O]  depicts  a  simple  architecture  with  three  processors, 
a  shared  memory,  and  a  shared  bus.  One  consequence  of  this  contention  is  that  under 
self-timed,  iterative  execution,  there  is  no  known  method  for  deriving  an  analytical  ex¬ 
pression  for  the  throughput  of  the  system  [11011],  and  thus,  simulation  is  required  to  get  a 
clear  picture  of  application  performance.  However,  simulation  is  computationally  very 
expensive,  and  it  is  highly  undesirable  to  perform  simulation  inside  the  innermost  op¬ 
timization  loop  during  synthesis.  To  avoid  such  a  simulation,  an  accurate  and  efficient 
estimator  for  throughput  is  required.  In  this  chapter  we  will  present  an  efficient  estima¬ 
tor  for  the  throughput  of  these  systems  when  operating  in  a  self-timed,  iterative  manner. 
As  explained  in  Chapter  [Zj.  in  self-timed  execution  the  assignment  of  tasks  to  processors 
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Figure  4.1:  Schematic  of  a  three  processor  shared  bus  architecture. 

and  the  execution  ordering  of  tasks  on  each  processor  are  determined  at  compile-time, 
and  at  run-time,  processors  synchronize  with  one  another  only  based  on  inter-processor 
communication  requirements,  and  do  not  necessarily  synchronize  at  the  end  of  each  loop 
iteration. 

If  contention  is  resolved  deterministically,  and  execution  times  are  constant,  then 
self-timed  evolution  may  lead  to  an  initial  transient  state,  but  the  execution  will  even¬ 
tually  become  periodic.  This  holds  because  the  multiprocessor  may  be  modeled  as  a 
finite-state  system,  and  thus,  aperiodic  behavior — which  implies  the  presence  of  in¬ 
finitely  many  states — cannot  hold.  In  DSP  systems,  although  execution  times  are  not 
always  constant  or  known  precisely,  they  typically  adhere  closely  to  their  respective 
estimates  with  high  frequency.  Under  such  conditions,  the  periodic  execution  pattern 
obtained  from  the  estimated  execution  times  provides  an  estimate  of  overall  system 
throughput  based  on  the  task-level  estimates.  The  estimates  for  task  execution  times 
can  be  obtained  through  several  methods.  The  most  straightforward  is  for  the  program¬ 
mer  to  provide  them  while  developing  a  library  of  primitive  blocks,  as  is  done  in  the 
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Ptolemy  system  [1211] .  Analytical  techniques  also  exist.  Li  and  Malik  [B2 Ij  have  proposed 
algorithms  for  estimating  the  execution  time  of  embedded  software  tasks  in  an  efficient 
manner.  Due  to  the  largely  deterministic  nature  of  DSP  applications,  such  system-level 
performance  analysis,  and  optimization  based  on  task-level  estimates  is  common  prac¬ 
tice  in  the  DSP  design  community  [1251, 55,  50,  BEQ. 

For  self-timed  systems,  when  we  apply  execution  time  estimates  to  estimate  over¬ 
all  throughput,  it  is  necessary  to  simulate  (using  the  execution  time  estimates)  past  the 
transient  state  until  a  periodic  execution  pattern  (steady  state)  emerges.  Unfortunately, 
the  duration  of  the  transient  may  be  exponential  in  the  size  of  the  application  specifica¬ 
tion  [11011].  and  this  makes  simulation-intensive,  iterative  synthesis  highly  unattractive. 

We  introduced  the  novel  period  graph  model  [H]  in  order  to  greatly  reduce  the  rate 
at  which  simulation  must  be  carried  out  during  iterative  synthesis.  Given  an  assign¬ 
ment  v  of  task  execution  times,  and  a  self-timed  schedule,  the  associated  period  graph  is 
constructed  from  the  periodic,  steady-state  pattern  of  the  resulting  simulation.  The  max¬ 
imum  cycle  mean  (MCM)  of  the  period  graph  (with  certain  adjustments)  is  then  used 
as  a  computationally-efficient  means  of  estimating  the  iteration  period  (the  reciprocal  of 
the  throughput)  as  changes  are  explored  within  a  neighborhood  of  v.  In  this  context,  the 
MCM  is  the  maximum  over  all  directed  cycles  of  the  sum  of  the  task  execution  times 
divided  by  the  sum  of  the  edge  delays.  The  MCM  can  be  computed  in  low  polynomial 
time  [E0D- 

4.2  Constructing  the  Period  Graph 

The  first  step  in  the  construction  of  the  period  graph  is  the  identification  of  the  period 
from  the  simulator  output.  This  can  be  performed  by  tracing  backward  through  the 
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simulation  and  searching  for  the  latest  intermediate  time  instant  ta  at  which  the  system 
state  S(ta )  equals  the  state  S(tj)  obtained  at  the  end  of  the  simulation  (here,  t  j  denotes 
the  simulation  time  limit).  If  no  match  is  found,  then  the  end  of  the  first  period  exceeds 
t f,  and  thus,  the  simulation  needs  to  be  extended  beyond  tf.  Otherwise,  the  region  (often 
depicted  as  a  Gantt  chart)  that  spans  the  interval  [ta,tf]  constitutes  a  (minimal)  period 
of  the  simulated  steady  state. 

Here,  the  system  state  S(t)  contains  the  execution  state  of  each  processor,  which 
is  either  “idle”  or  representable  by  an  ordered  pair  (A,t),  where  A  is  the  task  being 
executed  at  time  t,  and  r  denotes  the  time  remaining  until  the  current  invocation  of  A 
is  completed.  The  state  S(t)  also  contains  the  current  buffer  sizes  of  all  IPC  buffers, 
as  well  as  any  information  (e.g.,  request  queue  status)  that  is  used  by  the  protocol  for 
resolution  of  communication  contention.  Our  approach  to  efficiently  determining  the 
period  is  as  follows: 

•  Perform  a  simulation  of  the  schedule  for  some  time  Tsirn.  Define  a  constant  C, 
which  is  an  initial  estimate  for  the  number  of  complete  cycles  (invocations)  of  the 
graph  that  must  be  simulated  in  order  to  find  a  period,  this  constant  represents  the 
length  of  the  initial  transient,  before  the  output  becomes  periodic.  If  this  initial 
estimate  is  too  low,  it  will  be  increased  during  the  algorithm.  Let  N  be  the  number 
of  processors,  and  let  rij  be  the  number  of  tasks  scheduled  on  processor  j,  where 
j  e  [1 ,  TV] .  Tasks  include  IPC  tasks  as  well  as  computational  tasks.  Label  these 
tasks  V\;j ,  V2j , . . .  Vnj .  We  consider  the  case  where  the  system  executes  these  tasks 
infinitely.  The  invocation  number  of  a  task  is  defined  as  the  number  of  times  a 
given  task  has  executed,  and  is  denoted  with  a  superscript.  For  example, 
denotes  the  6th  invocation  of  task  a  on  processor  j.  Define  a  simulation  array 
for  each  processor  Sirn?  [i]  where  i  G  [1,  Mj\  and  Mj  is  the  number  of  tasks  on 
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processor  j  that  were  output  by  the  simulator.  The  elements  of  the  simulation 
array  are  the  tasks,  and  are  ordered  by  reverse  start  time,  so  that  Start  (Sim,,- [i])  > 
Start  (Sinij[i  +  1]). 

•  Create  two  idle  vectors  of  length  n;  for  each  processor  spanning  one  invocation. 
Label  the  first  idle  vector  Idlelj  [k]  where  k  G  [1,  rij}.  Label  the  second  idle  vector 
Idle2  )[k], 

•  Examine  the  1PC  buffer  vector  at  some  fixed  point  of  each  idle  vector.  The  IPC 
buffer  vector  consists  of  the  numbers  of  tokens  queued  on  all  the  IPC  edges  of 
the  graph  enumerated  in  some  order.  The  IPC  buffer  vector  must  be  output  by  the 
simulator  at  least  once  every  graph  iteration.  For  example,  the  simulator  could 
output  an  IPC  buffer  vector  for  each  processor  every  time  the  processor  executes 
the  first  task  scheduled  on  it.  In  this  way,  each  idle  vector  would  be  associated  with 
one  IPC  buffer  vector.  Label  these  vectors  IPCBufl^fg]  and  IPCBuf2J[g]  where 
q  G  [1 ,  E]  and  E  is  the  number  of  edges  in  the  IPC  graph.  The  IPC  buffer  vector 
represents  the  state  of  the  communication  buffers  in  the  system.  Let  Tokens(e,  t) 
be  the  number  of  data  tokens  on  edge  e  at  time  t.  Let  TaskNunij(t)  be  the  number 
of  the  node  that  is  executing  on  processor  j  at  time  t.  Pseudo-code  from  [HO]  for 
constructing  the  period  graph  is  shown  in  Figure  [4.4 

Our  experience  suggests  that  in  practice,  most  graphs  have  periods  spanning  only  a 
few  invocations,  so  the  above  procedure  for  finding  the  period  is  efficient.  For  a  system 
with  a  period  that  spans  N  invocations  and  with  at  most  L  tasks  per  processor,  this 
method  requires  LN(N  +  1)  comparisons. 

Figure  [01(a)  illustrates  an  application  graph,  Figure  [01(  b)  illustrates  a  self-timed 
schedule,  Figure  [OKc)  shows  the  periodic  steady  state  that  results  from  the  schedule 
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Algorithm  4.1:  Calculate  Period(C,  Tsim) 


minc  <—  0 
while  minc  <  C 

{lNCREMENT(Tsim) 
SIMULATE  (Tsim) 

minc  <—  min  ^L^j) 
for  t  <—  0  to  TSim 


(  for  j  <—  1  to  N 


do  < 


do  < 


a,j  =  TaskNumj(t ) 
invocation  j  [a]  <—  invocation  j  [a]  +  1 
bj  —  invocation j[a\ 
if  TaskNumj(t )  >  TaskNumj{t  —  1) 
then  {Simlj[i]  =  Vb{j)a{j ) 


span  —  0 

repeat 

r  span  <—  span  +  1 
for  k  <—  1  to  span  *  n\ 

(  for  j  1  to  N 


if  span  *  iij  >  Mj 

then  {comment:  error:  increase  C  and  start  over 

<  do  l 

,j0  Idlej[k\  —  Finish(Simj[k])  —  Start(Simj[k  +  1]) 

yidle2-[k\  —  Finish{Simj[span  *  nj  +  k])  —  Start{Simj[span  *  nj  +  1  +  k] ) 

for  q  1  to  E 

( IPCBufl[q]  —  Tokens(q,  Start(Simi[l])) 

\lPCBuf2[q\  —  Tokens(q,  Start(Simi[span  *  n\  +  1])) 
until  YljUdle]  =  Idle])  =  1  and  (IPCBufl  =  IPCBuf2) 


Figure  4.2:  Pseudocode  for  constructing  the  period  graph. 
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Figure  4.3:  An  illustration  of  the  period  graph  construct. 
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and  execution  time  estimates,  and  Figure  fl3Kd)  depicts  the  resulting  period  graph.  The 
shaded  nodes  in  Figure  [Ol(d)  correspond  to  idle  time  ranges  in  the  period,  and  solid 
black  circles  on  edges  represent  delays,  which  model  inter-iteration  dependencies.  Note 
that  the  steady  state  period  may  span  multiple  graph  iterations  (two  in  this  example), 
and  in  the  period  graph,  this  translates  to  multiple  instances  of  each  application  graph 
task. 

For  clarity  in  this  illustration,  we  have  assumed  negligible  latency  associated  with 
IPC.  As  described  below,  non-negligible  IPC  costs  can  easily  be  accommodated  in  the 
period  graph  model  by  introducing  send  and  receive  tasks  at  appropriate  points. 

As  illustrated  in  Figure  fO[.  the  period  graph  consists  of  all  the  tasks  comprising  the 
period  that  was  detected,  with  the  idle  time  ranges  between  tasks  (including  those  that 
are  caused  by  communication  contention)  also  treated  as  nodes  in  the  graph.  The  nodes 
are  connected  by  edges  in  the  order  that  they  appear  in  the  period.  An  edge  is  placed 
from  the  last  node  in  the  period  for  each  processor  to  the  first  node  in  the  period.  This 
edge  is  given  a  delay  value  of  one  (to  model  the  associated  transition  between  period 
iterations),  while  all  of  the  other  intraprocessor  edges  have  delay  values  of  zero.  This  is 
done  for  all  the  processors  in  the  system.  Our  model  utilizes  send  and  receive  nodes  for 
IPC  as  described  above.  For  each  IPC  point,  a  send  node  is  placed  on  the  processor  that 
is  sending  data,  and  a  corresponding  receive  node  is  placed  on  the  processor  that  will 
receive  the  data.  The  period  graph  is  completed  by  adding  an  edge  from  each  send  node 
to  its  corresponding  receive  node. 
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4.3  Fidelity  of  the  Estimator 


As  mentioned  above,  the  period  graph  can  be  used  to  estimate  the  system  throughput  of 


a  given  self-timed  schedule  as  the  task  execution  times  are  varied.  In  order  to  make  a 


concrete  example,  we  will  examine  voltage  scaling.  Some  processors  have  the  ability 
to  alter  their  execution  voltage  while  in  operation.  This  allows  the  processor  to  operate 
at  an  optimal  energy/efficiency  point.  When  the  voltage  on  a  processor  is  varied,  the 
execution  time  of  a  computational  task  varies  according  to 


(4.1) 


where  Vdd  is  the  supply  voltage,  Vt  is  the  threshold  voltage,  and  k  is  a  constant  [f23j. 
We  use  a  value  of  0.8volts  for  the  threshold  voltage.  The  execution  time  pe,:  of  each  of 
these  states  in  the  original  (non-scaled)  period  graph  is  referenced  to  a  voltage  Vrej.  The 
change  in  execution  time  of  each  computational  node  is  found  by  taking  the  derivative: 


where  Vsc  is  the  new  voltage.  It  is  not  obvious,  however,  how  one  should  adjust  the 
idle  times  in  the  period  graph.  We  separate  the  idle  nodes  into  two  sets:  contention 
idles  and  data  idles.  When  a  node  has  the  necessary  data  to  execute  (the  necessary  data 
has  already  been  produced),  but  is  idle  waiting  for  access  to  the  bus,  the  associated  idle 
node  is  classified  as  a  contention  idle.  When  a  node  is  idle  waiting  for  its  predecessors’ 
data,  the  associated  idle  node  is  classified  as  a  data  idle.  By  experimenting  with  a  large 
number  of  application  graphs,  we  found  that  we  could  capture  the  effects  of  contention 
and  obtain  the  best  fidelity  by  zeroing  out  the  data  idles  and  leaving  the  contention  idles 
constant  as  the  computation  idles  are  scaled.  Using  these  rules,  the  fidelity  is  calculated 
as  follows: 
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•  Given  an  application  graph,  construct  a  valid  schedule.  We  used  the  dynamic 
level  scheduling  algorithm  given  by  Sih  and  Lee  [FTTiJ.  Next,  construct  the  period 
graph  as  discussed  earlier.  Generate  N  voltage  vectors  (assignments  of  voltages 
to  the  processors  in  the  target  architecture).  For  each  voltage  vector,  perform 
a  simulation  to  determine  the  throughput,  with  the  execution  times  of  the  tasks 
on  each  processor  given  by  |TT]  according  to  the  voltage  on  the  processor.  Also, 
obtain  an  estimate  for  the  throughput  by  calculating  the  MCM  of  the  voltage- 
scaled  period  graph,  in  which  the  execution  times  of  the  computation  nodes  are 
given  by  [PI  and  the  execution  times  of  the  idle  nodes  are  as  explained  above. 


•  Calculate  the  fidelity  according  to: 


FidC'ity  =  N(N-l)  ^  £  h  <4'2> 

'  '  i= 1  j=i+ 1 

where 

J  1  if  sign  (Si  -  Sj)  =  sign  (M*  -  Mj) 
fij  ~  ^ 

0  otherwise 


sign(x) 


—1  if  x  <  0 
<  0  if  x  =  0 
1  if  x  >  0 


The  Si  denote  the  simulated  throughput  values;  and  the  M,  are  the  corresponding 
estimates  from  the  period  graph. 


Figure  fO]  plots  Fidelity  for  a  six-processor  system  in  which  the  voltage  on  the 
individual  processors  can  vary  between  plus  or  minus  five  percent.  The  x-axis  represents 
the  sum  of  the  absolute  values  of  the  voltage  changes  over  all  processors.  Each  point 
on  the  graph  is  a  fidelity  calculation  for  N  =  100  voltage  vectors.  A  value  of  one 
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Fidelity 


Fidelity  -  6  processors  each  changing  by  at  most  15% 


Figure  4.4:  Plot  of  fidelity  (equation  |4.2|)  for  a  six  processor  system  vs.  magnitude  of 
voltage  change  on  all  processors. 
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6  processors  each  changing  by  at  most  15% 


Figure  4.5:  Plot  of  average  error  (equation  ft~3D  of  fidelity  estimate  for  a  six  processor 
system  vs.  voltage  change  on  processors. 

is  a  “perfect”  fidelity.  It  can  be  seen  that  in  the  range  shown,  the  fidelity  is  always 
greater  than  0.77.  It  is  also  important  that  the  estimator  have  a  small  error  at  each  point. 
Figure  |3]  plots 

N 

Y^KSi-MiySi]  (4.3) 

i= 1 

for  a  six  processor  system.  It  can  be  seen  that  the  error  increases  as  the  voltage  vector 
moves  away  from  the  reference  point,  and  that  the  estimate  is  slightly  biased.  For  the 
range  shown  in  the  graphs,  where  each  processor  voltage  is  changed  by  a  maximum  of 
fifteen  percent,  the  error  is  less  than  four  percent. 
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4.4  Using  the  Period  Graph  in  a  Joint  Power/Performance 


Algorithm 

An  effective  way  to  reduce  power  consumption  of  a  processor  core  in  CMOS  technology 
is  to  lower  the  supply  voltage  level,  which  exploits  the  quadratic  dependence  of  power 
on  voltage  [21] .  Reducing  the  supply  voltage  also  has  the  effect  of  decreasing  the  clock 
speed  and  increasing  circuit  delay.  The  circuit  delay  can  be  modeled  by  |4. 1|.  The  power 
consumption  is  given  by 

P  =  aCLVd2df  (4.4) 

where  /  is  the  clock  frequency,  Cl  is  the  load  capacitance,  and  a  is  the  switching 
activity  [121].  To  accommodate  the  possibility  of  putting  processors  in  states  of  lower 
switching  activity  during  idle  periods,  our  model  includes  a  parameter  cqdie  for  the  idle 
states,  and  a  parameter  anon-idie  for  the  computational  tasks,  where  cqdie  <  anon-idie- 
A  more  detailed  power  analysis  could  assign  a  different  a  for  each  computational  task 
if  that  data  were  available.  A  different  power  optimization  technique,  which  can  be 
used  in  conjunction  with  the  voltage  scaling  technique  presented  here,  utilizes  a  nearly 
complete  processor  shutdown  during  the  idle  periods  052, 110.11].  In  our  model,  this  would 
correspond  to  aidle  =  0.  Our  model  for  the  power  is  the  average  energy  consumption  per 
graph  iteration  period.  This  corresponds  in  a  typical  DSP  system  to  the  average  energy 
required  to  process  one  sample.  Here,  the  energy  of  each  node  equals  its  power  times 
its  execution  time. 

In  a  system  consisting  of  multiple  processors,  one  has  the  ability  to  choose,  within 
a  certain  range,  the  (fixed)  operating  voltage  on  each  processor.  This  opens  up  an  addi¬ 
tional  degree  of  freedom  that  can  be  exploited  to  minimize  the  system  power  consump¬ 
tion.  By  choosing  a  lower  voltage  of  a  processor  that  is  executing  tasks  that  are  not 
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on  the  critical  path,  the  throughput  can  remain  unchanged  while  the  overall  power  con¬ 
sumption  is  reduced.  In  general,  a  combination  of  raising  voltages  on  some  processors 
while  lowering  others  can  yield  the  most  attractive  power/performance  solution. 

When  applying  voltage  scaling  to  a  multiprocessor  system,  the  valid  solution  space 
is  typically  much  too  large  to  search  by  brute-force  methods.  In  addition,  since  there 
is  no  general  analytical  formula  for  calculating  the  throughput  of  these  systems  in  the 
presence  of  communication  resource  contention,  each  candidate  solution  must  either  be 
simulated  or  estimated  using  some  heuristic. 

4.5  Genetic  Algorithm  Formulation 

To  demonstrate  the  general  utility  of  the  period  graph  based  performance  estimation 
approach,  we  incorporated  it  into  two  significantly  different  probabilistic  search  tech¬ 
niques  to  derive  two  different  algorithms  for  systematic  voltage  scaling  [EH-  The  first 
algorithm  presented  utilizes  the  framework  of  genetic  algorithms  (GAs)  [E3Q.  We  will 
discuss  GAs  more  in  Chapter  |Sj.  The  specific  GA  explored  here  consists  of  an  inner  GA 
nested  within  an  outer  GA.  The  inner  GA  performs  a  local  search  around  a  point  from 
the  population  of  the  outer  GA,  using  the  MCM  of  the  period  graph  in  its  objective  func¬ 
tion  as  an  estimate  for  the  throughput.  A  period  constraint  Tconstraint  is  given  as  an  input 
to  the  optimization  problem,  where  the  period  is  the  reciprocal  of  the  throughput.  The 
objective  function  calculates  the  power  consumption  associated  with  each  solution  by 
calculating  the  total  energy  per  period,  as  discussed  earlier.  If  the  period  associated  with 
a  solution  violates  the  period  constraint  Tsoiution  >  T(:onst  niinti,  the  power  consumption  is 
multiplied  by  a  large  penalty  factor  eW0^Taolution~Tconstra-int\  The  GA  attempts  to  minimize 
this  objective  function. 
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In  the  outer  loop,  a  population  of  Noutei  voltage  vectors  is  generated.  A  simulation 
is  run  and  a  period  graph  constructed  for  each  of  these  outer  loop  voltage  vectors.  For 
each  of  the  outer  loop  voltage  vectors,  a  new  inner  loop  population  is  generated  such 
that  |  Vouteri  -  I4lie„  |  <  e  for  i  e  Nproc  where  Nproc  is  the  number  of  processors,  Vv)Uter, 
is  the  voltage  on  processor  i  in  the  outer  population,  V\xmeTi  is  the  voltage  on  processor 
i  in  the  inner  population,  and  e  is  a  user-defined  threshold.  The  inner  population  size  is 
-dinner-  The  inner  GA  then  performs  a  local  search  using  this  population  for  a  number 
of  generations  Generationsinner  in  an  attempt  to  find  a  locally  optimal  voltage  vector. 
The  inner  GA  uses  the  MCM  of  the  period  graph  in  its  objective  function.  After  an 
invocation  of  the  inner  GA  is  finished,  one  simulation  is  performed  using  the  resulting 
voltage  vector,  and  the  actual  throughput  for  this  point  is  used  to  compute  its  fitness.  The 
outer  loop  voltage  vector  is  then  replaced  with  this  locally-optimized  voltage  vector  for 
use  in  the  next  outer  loop  generation.  The  outer  loop  is  run  for  a  number  of  generations 
Gener  at  ionsouter  • 


4.6  Simulated  Annealing  Algorithm 

Simulated  annealing  is  another  well-known  method  for  searching  large  design  spaces. 
Using  a  standard  simulated  annealing  package  0230,  we  have  implemented  an  alternative 
version  of  period-graph-based  voltage  scaling  optimization.  The  objective  function  here 
is  the  same  as  for  the  genetic  algorithm.  The  system  is  first  simulated  with  an  initial 
voltage  vector  Vj  =  LSVj,  and  the  period  graph  is  built.  In  order  to  insure  that  the 
period  graph  will  be  a  good  enough  estimator,  a  re-simulation  threshold  T  is  maintained. 
The  difference  between  the  current  input  C  V?  to  the  objective  function,  and  the  voltage 
vector  LSVj  corresponding  to  the  simulation  used  to  compute  the  current  period,  is 
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calculated.  If 


1  A  Vi  —  LSV; 

—  \  _J - 1  >  T 

N  ^  LSV* 

1=1 

the  graph  is  re-simulated  using  CVr  The  period  graph  is  rebuilt,  and  CV?  — >  LSVr 
For  T  =  0,  the  graph  will  be  re-simulated  every  time,  and  the  period  graph  will  offer  no 
speedup  to  the  optimization.  The  larger  the  value  of  T,  the  less  often  the  graph  will  be 
re- simulated,  and  the  faster  the  optimization  algorithm  will  perform.  However,  when  T 
is  too  large,  the  fidelity  of  the  period  graph  estimate  will  be  unacceptably  low  and  the 
quality  of  the  final  result  will  suffer.  Based  on  our  experiments  with  a  number  of  graphs, 
the  optimal  value  of  T  is  application-dependent,  but  a  value  of  T  =  0.1  generally  gives 
good  results. 


4.7  Results  of  Voltage  Scaling  using  Period  Graph 

Figure  |4.b(a)|  shows  an  example  of  the  reduction  in  power  resulting  from  the  genetic 
optimization  algorithm  on  the  FFT2  application  graph.  The  parameters  of  the  GA  were 
-Pouter  =  dinner  =  50,  Generationsouter  =  10,  and  GenerationSinner  =  20.  The  local 
search  voltages  were  constrained  to  be  within  five  percent  of  the  corresponding  outer 
loop  voltages.  The  period  constraint  was  calculated  by  simulating  the  system  with  all 
six  processors  operating  at  voltage  Uref.  For  this  example,  the  system  power  consump¬ 
tion  was  reduced  by  43%,  while  maintaining  the  original  throughput.  To  evaluate  the 
advantage  of  the  period  graph  approach  over  using  brute-force  simulation,  a  second 
nested  GA  was  implemented.  This  algorithm  was  identical  to  the  algorithm  discussed 
above,  except  that  the  inner  loop  did  not  use  the  period  graph  estimate  for  the  through¬ 
put.  Instead,  each  voltage  vector  was  evaluated  by  simulation.  This  algorithm  consumed 
21  times  more  CPU  time,  and  produced  similar  results,  as  shown  in  Figure  |4.6(b)|. 
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Genetic  algo.  fft3  (fixed  throughput  constraint)  using  period  graph 


(a)  Using  period  graph.  Each  iteration  requires  6  minutes 
CPU  time. 


Genetic  algo.  fft3  (fixed  throughput  constraint)  using  simulation  only 


(b)  Using  simulation  only.  Each  iteration  requires  126  min¬ 
utes  CPU  time. 

Figure  4.6:  Plot  of  (optimized  power)/(initial  power)  vs.  genetic  algorithm  iteration 
using  the  period  graph  estimator  (a)  and  simulation  only  (b).  Using  simulation  only,  the 
iterations  require  21  times  more  CPU  time. 
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simulated  annealing  FFT3 


Figure  4.7:  Plot  of  (optimized  power )/(initial  power)  vs.  time  for  the  simulated  anneal¬ 
ing  algorithm  combined  with  period  graph  on  the  FFT3  application. 

Figure  |77]  summarizes  the  power  reduction  results  for  the  simulated  annealing  algo¬ 
rithm  applied  to  a  fast  Fourier  transform  (FFT3)  application  graph,  for  different  values 
of  the  re-simulation  threshold  T.  It  can  be  seen  that  as  T  is  increased,  the  algorithm 
progresses  more  quickly.  The  simulated  annealing  algorithm  begins  with  a  ‘melting’ 
routine,  where  the  temperature  is  increased  until  a  phase  change  is  detected.  The  initial 
flat  part  of  the  curves  corresponds  to  the  time  spent  in  the  melting  routine.  We  have 
found  that  for  values  of  T  above  20%,  the  period  graph  is  not  a  good  enough  estimator 
and  the  algorithm  does  not  converge. 

Table  |TT]  summarizes  the  power  reduction  for  the  simulated  annealing  algorithm  for 
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application  re-simulation  threshold 


- 

0 

2% 

5% 

10% 

25% 

FFT1(28) 

0.96 

0.95 

0.65 

0.60 

1 

FFT2(28) 

0.97 

0.90 

0.71 

0.97 

1 

FFT3(28) 

1 

0.77 

0.59 

0.59 

1 

mus(20) 

0.89 

0.71 

0.67 

0.82 

1 

meas(12) 

0.77 

0.73 

0.81 

0.82 

1 

qmf(14) 

0.84 

0.65 

0.67 

0.73 

1 

rand  1(30) 

0.91 

0.77 

0.53 

0.65 

1 

rand2(100) 

1 

0.85 

0.77 

0.73 

1 

rand3(200) 

1 

1 

1 

0.94 

1 

Table  4. 1 :  Ratio  of  optimized  power  to  initial  power  for  a  fixed  computation  time  using 
period  graph  and  simulated  annealing. 
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several  additional  applications  using  different  values  of  the  re-simulation  threshold.  At 
the  start  of  the  optimization,  all  processor  voltages  were  set  at  5  volts.  The  throughput  at 
this  point  was  used  as  the  throughput  constraint.  In  Table  [Ol  the  first  three  rows  corre¬ 
spond  to  three  different  FFT  implementations:  mus  refers  to  a  music  synthesis  algorithm, 
qmf  refers  to  a  quadrature  mirror  filter  bank,  me  as  is  a  measurement  application. 

We  implemented  a  random  application  graph  generator  based  on  Sih’s  algorithm  [ESQ. 
The  last  three  rows  of  Table  [O]  correspond  to  three  random  graphs  generated  with  this 
algorithm.  The  numbers  in  parentheses  give  the  numbers  of  nodes  in  these  applications. 
The  optimization  was  performed  for  a  fixed  time  of  30  minutes  in  each  case.  The  op¬ 
timum  re-simulation  threshold  was  between  2%  and  10%  in  all  cases.  For  T  =  0.25, 
the  period  graph  is  not  a  good  estimator  and  none  of  the  results  returned  during  the  op¬ 
timization  algorithm  satisfied  the  throughput  constraint.  For  the  largest  graph,  the  fixed 
simulation  time  was  not  long  enough  to  make  much  improvement,  but  the  best  result 
occurred  for  T  =  0.1,  where  the  simulations  are  less  frequent. 

Table  fO|  summarizes  the  power  reduction  for  the  genetic  algorithm  with  and  without 
using  the  period  graph,  with  a  fixed  compile  time  (run  time)  of  one  hour.  It  can  be  seen 
that,  under  the  condition  of  fixed  compile  time,  we  achieve  better  results  (lower  power) 
when  utilizing  the  period  graph.  Also,  comparing  Table  [FT]  with  Table  |4.2|.  we  see 
that  the  longer  compile  time  given  to  the  GA  produced  better  results.  We  will  explore 
the  issue  of  search  efficiency  under  fixed  optimization  times  in  a  systematic  manner  in 
Chapter  §. 
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application 

using  period  graph 

no  period  graph 

fftl 

0.54 

0.74 

fft2 

0.69 

0.86 

fft3 

0.57 

0.78 

mus 

0.68 

0.90 

meas 

0.70 

0.82 

qmf 

0.64 

0.84 

randl(30) 

0.55 

0.78 

rand2(100) 

0.70 

1 

rand3(200) 

0.87 

1 

Table  4.2:  Ratio  of  (optimized  power)/(initial  power)  for  genetic  algorithm  with  fixed 
run  time. 

4.8  Summary  of  Period  Graph  Work 

We  have  developed  a  period  graph  model  that  can  be  used  as  a  computationally  effi¬ 
cient  estimator  for  the  throughput  in  multiprocessor  systems  in  which  communication 
contention  renders  exact  analysis  too  time-consuming.  This  model  is  especially  useful  in 
interactive  synthesis  techniques,  such  as  those  based  on  probabilistic  search.  We  demon¬ 
strated  effective  voltage  scaling  techniques  based  on  incorporating  the  period  graph  into 
genetic  algorithm  and  simulated  annealing  formulations. 
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Chapter  5 


Contention  Analysis  in  Optically  Connected 

Systems 


We  introduced  the  IPC  graph  model  in  Section  [I7F]  and  showed  that  for  systems  without 
contention,  the  maximum  cycle  mean  (MCM)  of  the  IPC  graph  can  be  used  as  an  effi¬ 
cient  estimator  for  the  system  throughput.  We  showed  in  Chapter  [|  that  for  a  shared-bus 
system  the  analysis  is  complicated  significantly  by  contention  for  the  bus  among  the 
processors.  Shared  bus  systems  are  appealing  due  to  their  simplicity  and  low  cost.  This 
is  the  primary  driver  for  many  embedded  systems  applications.  In  Section  |57T]  we  will 
discuss  Sriram’s  ordered  transactions  strategy  LI1021J.  and  show  that  by  incorporating  an 
additional  hardware  controller  to  a  shared  bus  system,  it  is  possible  to  remove  the  con¬ 
tention  that  results  in  the  difficult  analysis,  and  to  more  fully  optimize  communication 
patterns.  With  the  hardware  controller  the  processors  still  share  a  communication  chan¬ 
nel,  namely  the  bus,  but  the  contention  is  resolved  by  the  controller.  For  systems  that 
require  the  performance,  the  cost  of  the  additional  hardware  may  be  justified. 

For  systems  with  significant  interprocessor  communication  activity  and  high  per¬ 
formance  requirements,  it  may  be  the  case  that  an  electronic  interconnect  between  the 
processors  is  not  appropriate.  In  Section  |3.S.2|  we  introduced  two  system  architectures 
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based  on  fiber  interconnects.  In  the  first  architecture,  there  is  a  dedicated  communica¬ 
tion  channel  between  every  pair  of  processors  so  there  is  no  sharing  of  communication 
resources.  In  the  second  architecture,  processors  share  their  input  ports — the  amount  of 
sharing  is  reduced  as  compared  to  an  electronic  bus  because  different  wavelengths  can 
be  transmitted  simultaneously  in  the  fiber.  In  Section  |5]2]  we  will  modify  Sriram’s  model 
to  work  with  this  architecture. 


5.1  Ordered  Transactions 

5.1.1  Ordered  Transactions  Concept 

The  ordered  transactions  strategy  for  multiprocessor  shared  bus  systems  consists  of  two 
parts: 

1.  Determine  at  compile  time  the  order  in  which  processor  communications  occur. 

2.  Enforce  that  order  at  run  time  with  a  hardware  controller. 

As  in  the  self-timed  approach,  a  static  schedule  is  first  computed  using  execution  time 
estimates  for  the  actors,  but  only  the  actor  ordering  on  each  processor  is  retained — the 
actor  start  times  are  discarded.  The  hardware  controller  grants  access  to  the  processors 
in  the  predetermined  order.  When  a  processor  is  granted  access  to  the  bus,  it  performs 
its  read  or  write  operation  and  releases  the  bus  back  to  the  hardware  controller.  Since 
the  hardware  controller  enforces  the  communication  order,  there  is  no  contention  for 
the  bus,  and  no  bus  arbitration  is  necessary  at  the  individual  processors.  The  transaction 
order  preserves  the  data  precedences  in  the  algorithm,  and  therefore  for  a  shared  memory 
system  no  semaphore  synchronization  is  necessary.  Also,  send  and  receive  operations 
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always  access  the  shared  bus  for  one  memory  cycle — there  is  no  polling  required.  This 
reduces  the  number  of  shared  memory  accesses  by  at  least  a  factor  of  two. 

5.1.2  Synchronization  Constraints 

We  introduced  the  interprocessor  communication  graph  (IPC  graph)  G'ipc  and  the  syn¬ 
chronization  graph  Gs  in  Section  [2.4|.  Initially,  Gs  is  identical  to  G'n>(j.  However,  various 
transformations  can  be  applied  to  Gs  in  order  to  make  the  overall  synchronization  struc¬ 
ture  more  efficient  [11  Oil].  After  all  transformations  on  Gs  are  complete,  Gs  and  G'ipc  can 
be  used  to  map  the  given  parallel  schedule  into  an  implementation  on  the  target  archi¬ 
tecture.  The  IPC  edges  in  G'ipc  represent  buffer  activity,  and  are  implemented  as  buffers 
in  shared  memory,  whereas  the  synchronization  edges  of  Gs  represent  synchronization 
constraints,  and  are  implemented  by  updating  and  testing  flags  in  shared  memory.  If 
there  is  an  IPC  edge  as  well  as  a  synchronization  edge  between  the  same  pair  of  tasks, 
then  a  synchronization  protocol  is  executed  before  the  buffer  corresponding  to  the  IPC 
edge  is  accessed  to  ensure  sender-receiver  synchronization.  On  the  other  hand,  if  there 
is  an  IPC  edge  between  two  tasks  in  the  G'ipc,  but  there  is  no  synchronization  edge  be¬ 
tween  the  two,  then  no  synchronization  needs  to  be  done  before  accessing  the  shared 
buffer.  If  there  is  a  synchronization  edge  between  two  tasks  but  no  IPC  edge,  then  no 
shared  buffer  is  allocated  between  the  two  tasks;  only  the  corresponding  synchronization 
protocol  is  invoked. 

Any  transformation  we  perform  on  the  synchronization  graph  must  respect  the  syn¬ 
chronization  constraints  implied  by  G'ipc.  If  we  ensure  this,  then  we  only  need  to  imple¬ 
ment  the  synchronization  edges  of  the  optimized  synchronization  graph  (in  conjunction 
with  the  IPC  edges  of  GiPC).  If  G*i  =  (V,  E)  and  G2  =  ( V ,  E )  are  synchronization 
graphs  with  the  same  vertex-set  and  the  same  set  of  intraprocessor  edges  (edges  that 
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are  not  synchronization  edges),  we  say  that  G i  preserves  G-2  if  for  all  e  G  E2  such  that 
e  ^  Ei,  we  have 

pGl(src(e),snk(e))  <  delay(e), 

where  f>c(x,  y)  =  oo  if  there  is  no  path  from  x  to  y  in  the  synchronization  graph  G,  and 
if  there  is  a  path  from  x  to  y,  then  pa(x.  y)  is  the  minimum  over  all  paths  p  directed  from 
x  to  y  of  the  sum  of  the  edge  delays  on  p.  Thus,  G  \  preserves  G2  if  for  any  new  edge 
in  G2  (i.e.,  for  any  edge  not  in  G'i ),  there  is  a  path  in  G',  directed  from  the  source  of  the 
edge  to  the  sink  that  has  a  cumulative  delay  that  is  less  than  or  equal  to  the  delay  of  the 
edge.  The  following  theorem  (developed  in  [11011])  is  fundamental  to  synchronization 
graph  analysis. 

Theorem  1  The  synchronization  constraints  in  a  synchronization  graph 
G\  =  (V,  E-Int  U  El)  imply  the  synchronization  constraints  of  the  synchronization 
graph  G2S  =  ( V ,  Eint  U  E2)  if  the  following  condition  holds:  Ve  s.t.  e  G  E2,  e  I  E\ , 
pG i  (src(e),  snk(e))  <  delay(e);  that  is,  if  for  each  edge  e  that  is  present  in  G2  but  not  in 
G],  there  is  a  minimum  delay  path  from  src(e)  to  snk(e)  in  G\  that  has  total  delay  of  at 
most  delay  (e). 

Theorem  1  is  the  basis  for  a  variety  of  useful  synchronization  graph  transforma¬ 
tions.  One  such  transformation  is  the  detection  and  removal  of  redundant  synchroniza¬ 
tion  edges,  which  are  synchronization  edges  whose  respective  synchronization  func¬ 
tions  are  subsumed  by  other  synchronization  edges,  and  thus  need  not  be  implemented 
explicitly.  Another  transformation,  called  re  synchronization ,  involves  inserting  synchro¬ 
nization  edges  in  a  way  that  the  number  of  original  synchronization  edges  that  become 
redundant  exceeds  the  number  of  new  edges  added. 
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5.1.3  Ordered  Transactions  Graph 

Sriram’s  ordered  transaction  graph  [11021]  is  a  useful  data  structure  for  analyzing  ordered 
transaction  implementations.  Given  an  ordering  {oi,  o2, . . . ,  op}  for  the  communication 
actors  in  an  IPC  graph  G'ipc  =  (Vipc,  -Eipc)>  the  corresponding  ordered  transaction 
graph  r(GipC,  O)  is  defined  as  the  directed  graph  G'ot  =  (Got  -  Eot),  where 

Vot  =  Vipc 
E ot  =  -Eipc  U  Eq 

Eo  =  {(op,  Oi),  (oi,  o2),  (o2,  o3), . . . ,  ( Op-i,  op)} 
delay(oj,  oi+1)  =  0  for  1  <  i  <  p 
delay(op,  ox)  =  1 

Thus,  an  IPC  graph  can  be  modified  by  adding  edges  (the  edges  of  E0 )  obtained  from 
the  ordering  O  to  create  the  ordered  transactions  graph. 

A  closely  related  data  structure  is  the  transaction  partial  order  graph  Gtpo-  The 
transaction  partial  order  graph  represents  the  minimum  set  of  dependencies  imposed 
among  different  processors  by  the  communication  actors  of  the  IPC  graph.  These  de¬ 
pendencies  must  be  obeyed  by  any  ordering  of  the  communication  operations.  Under  the 
assumption  that  the  send  and  receive  actors  are  serially  ordered  on  each  processor,  G'tpo 
can  be  computed  from  G'ipc  by  first  deleting  all  edges  in  G'ipc  that  have  delays  of  one 
or  more,  and  then  deleting  all  of  the  computation  actors  [E2Q.  However,  since  we  wish 
to  allow  for  the  possibility  of  data  transmission  on  multiple  channels  simultaneously  we 
do  not  make  this  assumption,  and  we  must  modify  the  algorithm  [EQ.  Figure  |57T]  gives 
pseudo-code  for  our  modified  algorithm  for  generating  G'tpo-  The  key  difference  with 
our  algorithm  is  that  G'ipc  may  now  contain  computation  nodes  with  multiple  predeces¬ 
sors  and  multiple  successors.  These  nodes  cannot  be  removed  since  this  would  require 
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Algorithm  5.1:  Generate  TPO  Graph(G'jpC) 


input:  IPC  Graph 

output:  Transaction  Partial  Order  TPO  Graph 

finished  FALSE 
for  (V  edges  e  G  Gjp^) 

J  if  (e  is  a  feedback  edge) 

0  \  then  { Delete  e 
while  (finished  =  FALSE) 
finished  =  TRUE 
for  (V  nodes  v  G  G*ipc ) 

fif  (v  is  a  computation  node) 

f  if  ((indeg('u)  =  0)  OR  (outdeg('u) 
f  Delete  v 


do  < 


do  < 


=  0)) 


then 


( finished  =  FALSE 


then  < 


else  if  (indeg(n)  =  1) 

'  p  <—  predecessor  node  of  v 
for  (V  successors  s  of  v) 
then  (  do  (Create  edge  (p,  s) 
Delete  v 

finished  =  FALSE 
else  if  (outdeg(r>)  =  1) 

s  <—  successor  node  of  v 
for  (V  predecessors  p  of  v) 
then  {  (  Create  edge  (p,s) 

do  <  Delete  v 

[  finished  =  FALSE 


Figure  5.1:  Pseudo-code  for  generating  the  TPO  graph  from  the  IPC  graph. 
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(a)  IPC  graph. 


(b)  Gtpo  after  (c)  Gtpo  after 
one  pass.  two  passes. 


Figure  5.2:  The  TPO  graph  Gtpo  derived  from  the  IPC  graph  in  (a)  after  1  pass  (b)  and 
2  passes  (c)  of  the  algorithm  given  in  Figure  |57T|. 

imposing  some  additional  dependencies  on  these  nodes,  and  we  want  Gtpo  to  represent 
the  minimal  set  of  dependencies.  In  each  pass  of  the  algorithm,  the  graph  is  reduced 
by  removing  as  many  computation  actors  as  possible.  It  terminates  when  no  more  com¬ 
putation  actors  can  be  removed.  We  will  see  in  Section  |5.2.2|  that  there  are  advantages 
to  operating  on  the  reduced  TPO  graph,  since  the  search  space  of  possible  transaction 
orders  can  be  exponential  in  the  size  of  the  graph. 

Figure  [572|  shows  an  example  of  a  how  the  transaction  partial  order  graph  is  derived. 
The  IPC  graph  is  shown  in  |5.2(a)|.  After  one  pass  of  the  algorithm  given  in  Figure  |5TT],  the 
TPO  graph  contains  two  computation  actors  (Figure  |5.2(b)|).  The  algorithm  terminates 
after  two  passes  with  one  computation  actor  remaining  (Figure  |5.2(b)|). 

As  described  earlier,  when  the  ordered  transaction  strategy  is  implemented  using 
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a  hardware  method  such  as  micro-controller  that  imposes  the  linear  order,  there  is  no 
need  for  synchronization  and  contention  for  shared  communication  resources  is  also 
eliminated.  Therefore,  if  the  execution  time  estimates  for  the  actors  are  accurate  or  are 
true  worst-case  values,  then  the  maximum  cycle  mean  (MCM)  of  the  ordered  transac¬ 
tion  graph  gives  an  accurate  estimate  or  worst-case  bound,  respectively,  of  the  itera¬ 
tion  period  of  the  associated  application  graph  under  the  ordered  transaction  strategy. 
Such  efficient,  accurate  performance  assessment  is  useful  for  design  space  exploration 
in  general,  and  it  is  especially  useful  when  implementing  applications  that  have  real  time 
constraints. 

If  interprocessor  communication  costs  are  negligible,  an  optimal  transaction  order 
can  be  computed  in  low  polynomial  time  for  a  given  self-timed  schedule  [111121].  This 
method  of  deriving  transaction  orders  is  called  the  Bellmcin-Ford  Based  (BFB)  method 
since  it  is  based  on  applying  the  Bellman-Ford  shortest  path  algorithm  to  an  intermediate 
graph  that  is  derived  from  the  given  self-timed  schedule. 

However,  when  IPC  costs  are  not  negligible,  as  is  frequently  and  increasingly  the 
case  in  practice,  the  problem  of  determining  an  optimal  transaction  order  is  NP-hard  062!] . 
This  intractability  has  been  shown  to  hold  both  under  iterative  and  non-iterative  execu¬ 
tion  of  application  graphs.  Thus,  under  nonzero  IPC  costs,  we  must  resort  to  heuristics 
for  efficient  solutions.  Furthermore,  the  polynomial-time  BFB  algorithm  is  no  longer 
optimal,  and  alternative  techniques  to  account  for  IPC  costs  are  preferable. 

In  the  presence  of  non-negligible  communication  costs,  an  efficient  transaction  order 
can  be  constructed  with  the  help  of  the  transaction  partial  order  graph  G'tpo  described 
earlier.  The  transaction  partial  order  algorithm  is  one  systematic  approach  for  using 
transaction  partial  order  graphs  to  construct  efficient  orderings  of  communication  oper¬ 
ations.  This  algorithm  proceeds  by  considering-one  by  one-each  vertex  of  G'tpo  that 
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has  no  input  edges  (vertices  in  the  transaction  partial  order  graph  that  have  on  input 
edges  are  called  ready  vertices)  as  a  candidate  to  be  scheduled  next  in  the  transaction 
order.  Interprocessor  edges  are  inserted  from  each  candidate  vertex  to  all  other  ready 
vertices  in  Gipc,  and  the  corresponding  MCM  is  measured.  The  candidate  whose  cor¬ 
responding  MCM  is  the  least  when  evaluated  in  this  fashion  is  chosen  as  the  next  vertex 
in  the  ordered  transaction,  and  deleted  from  Gtpo-  This  process  is  repeated  until  all 
communication  actors  have  been  scheduled  into  a  linear  ordering. 

Khandelia  [E20  shows  that  the  transaction  partial  order  heuristic  can  improve  the  per¬ 
formance  beyond  what  is  achievable  by  a  self-timed  schedule,  even  if  synchronization 
and  arbitration  costs  are  negligible  compared  to  actor  execution  times.  The  performance 
benefit  is  achieved  by  strategic  positioning  of  the  communication  operations  in  ways 
that  do  not  result  from  the  natural  evolution  of  self-timed  schedules. 


5.2  WDM  Ordered  Transactions 

In  some  applications,  a  shared  electronic  bus  cannot  handle  the  required  communication 
traffic,  even  if  this  traffic  is  carefully  optimized  by  using  the  transaction  partial  order 
heuristic.  Moving  to  a  faster  shared  medium  such  as  optical  ethemet  may  be  a  solution  in 
some  cases,  since  the  interprocessor  communication  (IPC)  is  faster.  However,  we  often 
cannot  derive  suitable  solutions  for  highly  parallel  applications  scheduled  on  multiple 
processors.  In  this  case  the  shared  nature  of  the  interconnect  becomes  a  bottleneck 
for  the  large  amount  of  IPC  required  to  effectively  use  all  the  processors.  For  these 
applications,  alternative  interconnection  topologies  are  required. 

One  such  alternative  is  to  use  multiple  busses.  Lee  and  Bier  [KS71J  describe  how  the 
ordered  transaction  strategy  can  be  extended  to  utilize  a  hierarchy  of  multiple  busses. 
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In  Section  |3.h.2|  we  introduced  the  WDMOT  architecture  utilizing  fiber  optics  which 
is  not  a  fully-connected  topology  (every  processor  pair  having  a  dedicated  channel),  but 
can  support  multiple  simultaneous  communications  on  different  wavelengths.  The  ad¬ 
vantage  of  this  architecture  over  the  fully-connected  topology  is  that  the  number  of 
wavelengths  required  scales  with  N  instead  of  N2,  where  N  is  the  number  of  proces¬ 
sors.  In  this  section  we  will  discuss  a  heuristic  for  developing  good  processor  orderings 
using  this  architecture,  and  compare  the  resulting  system  throughput  with  the  through¬ 
put  obtained  using  the  transaction  partial  order  algorithm  for  an  electrical  bus  and  with 
the  throughput  obtained  in  a  fully  connected  system. 

In  the  WDMOT  architecture,  we  implement  a  protocol  in  which  each  processor  is  as¬ 
signed  a  unique  wavelength.  We  must  ensure  that  two  processors  do  not  send  to  a  given 
processor  at  the  same  time — i.e.,  there  is  possible  contention  at  the  (single)  receiver 
of  every  processor.  In  order  to  accomplish  this,  we  introduce  a  controller  for  every 
wavelength.  This  controller  grants  access  to  only  one  processor  at  a  time.  Figure  |53] 
(repeated  from  Figure  |3.1U|)  depicts  the  architecture.  Three  fibers  are  shown — one  to 
carry  the  data,  one  for  the  wavelength  grant  signals,  and  one  for  the  wavelength  release 
signals.  The  grant  and  release  signals  indicate  that  the  wavelength  is  available  (wave¬ 
length  grant)  or  that  a  processor  is  finished  using  the  wavelength  (wavelength  release  or 
ack).  The  signal  for  wavelength  Xk  being  granted  to  processor  p  consists  of  an  ID  tag 
for  processor  p  transmitted  on  A*.  A  portion  of  the  grant  signal  is  split  off  the  grant  fiber 
and  distributed  to  each  processor,  where  it  is  separated  by  wavelength.  Processor  p  may 
transmit  on  Xk  if  it  receives  its  own  ID  tag  on  receiver  k.  This  is  shown  in  more  detail  in 
Figure  [571]  The  controller  for  wavelength  Xj  is  responsible  for  ordering  all  communica¬ 
tions  to  processor  p.,  (which  receives  data  on  Xj)  so  that  only  one  processor  is  attempting 
to  transmit  on  Xj  at  a  given  time.  In  order  to  accomplish  this,  the  wavelength  controller 
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Data  Fiber 


Figure  5.3:  Architecture  for  wavelength  ordered  transactions  (repeated  from  Fig¬ 
ure  |3 . 1  U|) . 
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Tunable  source  implemented  by  combining  multiple  discrete  sources 


Figure  5.4:  One  wavelength  controller  and  one  processor  in  the  WDMOT  architecture. 
The  boxes  marked  c  are  wavelength  combiners.  The  lower  part  of  the  figure  shows  how 
multiple  discrete  sources  can  be  combined  to  form  the  tunable  source. 
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Controller  for  wavelength  k 


Data  pk2: 


processor  (^2  sends  data  on  wavelength  k 


Figure  5.5:  Controller  state  diagram  and  synchronization  signals  used  in  the  WDMOT 
architecture. 

must  have  a  transmitter  at  the  single  wavelength  A j,  as  well  as  a  receiver  at  A j.  Each 
processor  p3  has  a  single  receiver  at  A?,  and  a  transmitter  for  each  processor  to  which  it 
will  send  data.  This  number  may  be  limited  by  a  fan-out  constraint.  Figure  |53]  shows 
the  state  diagram  for  a  controller  for  wavelength  A/,,.  In  Figure  [53]  there  are  m  proces¬ 
sors  scheduled  to  transmit  on  Xk,  and  the  they  are  ordered  \pk  \ .  pk2  , . . . .  pkrn] .  Since  the 
order  of  grants  to  A^  is  enforced  by  the  controller,  only  one  processor  transmits  on  A/,:  at 
a  given  time. 

In  Figure  |53]  we  show  a  tunable  source  being  used  in  conjunction  with  each  pro- 
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cessor.  Most  commercially  available  tunable  sources  today  use  some  type  of  microelec¬ 
tromechanical  (MEMS)  switching  element,  and  require  times  of  the  order  of  millisec¬ 
onds  to  tune  between  different  wavelengths.  We  can  consider  this  tuning  time  as  an 
addition  to  the  interprocessor  communication  time,  and  would  like  switching  times  on 
the  order  of  several  clock  cycles,  which  is  on  the  order  of  nanoseconds.  One  way  to  ac¬ 
complish  this  (although  not  strictly  a  ‘tunable’  source)  is  to  simply  use  multiple  discrete 
sources,  which  can  be  selected  electrically,  and  combine  them  together.  This  is  shown 
in  the  lower  portion  of  Figure  |5.4|.  Several  groups  are  currently  developing  fast  (Gb/s) 
tunable  wavelength  converters  for  optical  switching  applications  (e.g.  see  [CZ3,  EB3IJ),  al¬ 
though  the  number  of  output  wavelengths  demonstrated  has  been  small.  For  example, 
Masanovic  et  al.  recently  reported  a  tuning  range  of  22nm  for  an  ImP  tunable  laser  and 
wavelength  converter  [CZ53].  These  devices  may  one  day  be  suitable  for  the  WDMOT 
architecture. 

Using  the  WDMOT  architecture,  any  given  interconnect  topology  that  respects  the 
fanout  constraints  can  be  implemented.  We  show  in  Section  |0|  that  the  optimal  in¬ 
terconnect  topology  for  a  given  application  is  often  very  irregular.  We  will  explain  in 
Chapter  [7]  how  to  synthesize  an  optimal  interconnect  topology  for  a  given  application 
with  these  fanout  constraints. 

5.2.1  Optical  Components 

The  WDMOT  architecture  can  be  implemented  with  components  developed  for  the 
telecommunications  market.  The  optical  add/drop  multiplexer  (OADM)  is  a  basic  build¬ 
ing  block  of  many  optical  systems  where  signals  with  arbitrary  wavelengths  must  be 
multiplexed  to,  or  demultiplexed  from,  wavelength  multiplexed  signals.  Figure  |51]  [S551J 
depicts  the  basic  configuration  of  an  OADM  using  a  dielectric  thin  film  filter.  OADMs 
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TFF 


Figure  5.6:  Optical  add/drop  multiplexer  utilizing  a  dielectric  thin  film  filter. 
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can  be  used  at  each  processor  to  extract  the  correct  wavelength  for  incoming  data,  to 
separate  the  grant  signals  from  the  controllers,  to  multiplex  outgoing  data  onto  the  data 
fiber,  and  to  multiplex  the  acknowlege  signal  onto  the  acknowledge  fiber.  At  each  con¬ 
troller,  OADMs  can  be  used  to  add  the  grant  signal  onto  the  grant  fiber,  and  to  drop  the 
release  signal  from  the  acknowledge  fiber. 

5.2.2  Transaction  Ordering 

We  define  an  ordered  transaction  graph  Gwdmot  in  a  similar  manner  to  Section  |5.1.3|. 
Let  ep  be  the  set  of  communication  edges  in  Gipc  whose  target  nodes  are  scheduled  on 
processor  p,  and  /jp  be  the  set  of  nodes  that  are  sources  for  edges  ep.  In  order  to  ensure 
that  no  two  processors  attempt  to  transmit  on  the  same  wavelength  at  the  same  time,  we 
must  determine  a  transaction  ordering  Op  for  the  nodes  in  r/p  for  p  6  [1 . . .  N]  where  N 
is  the  number  of  processors.  Then 

T(G*ipc,  O i, . . . ,  On)  =  Gwdmot  =  (Vwdmot,  -^wdmot), 


where 


kwDMOT  —  Vf pc ,  and 


-^WDMOT  —  Eipc  U  Eo  1  U  Eq2  U  . . .  U  Eqn. 

Figure  |577]  shows  ordered  transaction  graphs  for  both  electrical  shared  bus  and  WDMOT 
architecture.  For  the  electrical  shared  bus,  all  communications  must  be  ordered  and 
four  additional  edges  must  be  added  to  Gypc-  For  the  WDMOT  architecture  only  two 
additional  edges  must  be  added  to  Gipc — nodes  A  and  F  which  both  send  to  processor 
2  must  be  ordered,  and  nodes  G  and  J  which  both  send  to  processor  4  must  be  ordered. 
Note  that  since  the  iteration  cycle  time  (reciprocal  of  the  throughput)  of  the  system  is 
determined  by  the  MCM  of  the  ordered  transaction  graph,  and  since  adding  edges  to  a 
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proc  1  proc  2  proc  3  proc  4  proc  5  proc  1  proc  2  proc  3  proc  4  proc  5 


global  ordering  (F-A-C-D-E-B-G-H-J-l)  WDM  ordering 

E0=  {(F,A),  (A,C),  (B,G),  (H,J)}  ^  =  null  E03=nu"  E05=nU" 

^q2  =  (F,A)  Eq4  =  (G,J) 


Figure  5.7:  Comparison  of  ordered  transaction  graphs  for  shared  bus  (left)  and  WDM 
architecture  (right).  Transaction  order  edges  are  shown  dashed. 

graph  cannot  decrease  the  MCM,  the  throughput  of  the  WDMOT  architecture  cannot  be 
less  than  that  of  the  electrical  shared  bus  architecture.  Next  we  address  the  question  of 
determining  the  orderings  Op  for  each  wavelength. 

We  first  note  that  for  the  ordering  to  be  correct,  we  should  not  introduce  any  zero- 
delay  cycles  into  the  IPC  graph.  Such  cycles  would  create  deadlock  in  the  system.  In 
Figure  |5T7]  for  example,  an  ordering  beginning  with  F  —>  C  —>  I)  ^  E  —>  /l,  plac¬ 
ing  E  before  A,  would  add  the  edge  (E,  A)  to  GiPC  and  create  the  cycle  /l  — >  C  — > 
D  — >  E  — >  A.  In  other  words,  the  transaction  ordering  should  be  a  topological  sort 
of  the  directed  acyclic  graph  resulting  from  removing  the  feedback  edges  from  Gipc- 
Equivalently,  since  Gtpo  preserves  all  the  dependencies  in  G'ipc,  the  transaction  order¬ 
ing  must  be  a  topological  sort  of  GTpo>  and  our  g°al  *s  then  to  find  the  best  topological 
sort.  Unfortunately,  the  number  of  possible  topological  sorts  can  be  exponential  in  the 
size  of  the  graph.  For  example,  for  a  complete  bipartite  graph  with  2 n  nodes,  there  are 
(n!)2  different  topological  sorts.  We  therefore  see  that  we  can  reduce  the  search  space 
substantially  by  reducing  the  IPC  graph  and  operating  on  the  TPO  graph. 
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Algorithm  5.1:  Choose  Communicaton  Actor( 


G'jpc ,  readyList 


) 


input:  ipc  graph  Gjp^ 
input:  list  of  actors  readyList 


output:  communication  actor  v 


if  readyList.  size  ()  =  1 
then  [u  <—  ready List.head() 
for  x  G  readyList 

for  y  G  readyList 

(ifx^y 


do 


then 


do  < 


Je  =  Gjp^.addedge(x,  y ) 
(temp.add(e) 
criteria[x]  <—  MCM(Gjp^) 


for  e  G  temp 
do  {Gjp£.delete(e) 
min  (criteria  [a;]) 


Figure  5.8:  Function  to  choose  the  next  communication  actor  in  the  transaction  or¬ 
der  [E29. 

We  will  see  in  Section  |5.2J|  that  a  random  topological  sort  produces  relatively  poor 
results,  and  we  must  derive  heuristics  to  guide  the  sort.  We  use  a  modification  of  the 
transaction  partial  order  heuristic  [E2].  In  this  heuristic,  edges  are  added  between  com¬ 
munication  nodes  (actors)  in  GiPC  that  are  contending  for  the  bus,  and  the  MCM  of  the 
modified  G'ipc  is  measured.  Actors  whose  corresponding  MCMs  are  better  are  sched¬ 
uled  earlier  in  the  transaction  order,  as  discussed  in  Section  |5.1.3|.  The  algorithm  for 
choosing  the  next  communication  actor  to  schedule  is  given  in  Figure  |5T>]  while  the 
transaction  partial  order  heuristic  is  given  in  Figure  [5.9|.  For  the  WDMOT  architecture, 
we  must  determine  an  ordering  for  each  wavelength  controller.  In  order  to  do  this,  we 
first  determine  a  global  ordering  of  the  communication  actors  using  the  TPO  heuristic, 
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Algorithm  5.2:  TPO  Heuristic( 


G'[pc ,  transactionOrder 


) 


input:  IPC  graph  G\pq 

output:  linear  list  of  communication  actors  transactionOrder 

compute  G'TPq  from  G'jpp’ 
for  v  G  G'jpp’ 

(  mark [v\  —  FALSE 
do  <  if  indegree(^)  =  0 

[  then  { ready List.append(i/) 


complete  <—  FALSE 

first  <-  TRUE 

while  (complete  /  TRUE) 

V  —  CHOOSE-COMMUNI C  ATI  ON  -  ACTO  R  ( G  Ipc .  Gxpo,  readyList) 
mark[^]  TRUE 
transactionOrder.  append(  u) 
if  (first) 

then  {first  FALSE 


do  < 


else  {G'jP(^.addedge(M;,  v) 

w  < —  v 

for  u  G  {v,u)  G  E 

"flag  —  TRUE 
for  s  G  (s,u)  G  E 

if  (mark(s)  =  FALSE) 
then  {flag  <—  FALSE 


do  < 


do 


if  (flag) 

then  {  ready  List.append(u) 
if  (readyList.emptyO  =  TRUE) 
then  {complete  <—  TRUE 


Figure  5.9:  TPO  heuristic  [I621J. 
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Algorithm  5.3:  WDM  TPO  Heuristic) 


GiPC,n 


) 


input:  IPC  graph  G'jp^ 

output:  linear  list  of  communication  actors  Qj  for  each  wavelength  j 

CALL  HeuristicTPO(Gjp(^,  transactionOrder) 
for  p  £  [1 . . .  N] 

for  edges  e  e  GIPC 
^  f  if  proc(target(e))  =  p 

.  |  then  { e„  <—  e„  U  e 

do  j  for  edges  see, 

do  [pp  <—  source(e) 

Sip  <—  SORT)  rjj,  using  transactionOrder) 


Figure  5.10:  WDM  ordered  transactions  algorithm. 

and  then  use  this  ordering  to  sort  the  actors  in  each  ?/p.  Pseudocode  for  the  algorithm  is 
given  in  Figure  |5.1U|. 

5.2.3  Experiments 

We  ran  the  WDM  ordered  transactions  algorithms  on  a  set  of  randomly  generated  graphs. 
These  graphs  were  generated  using  a  modified  verion  of  Sih’s  method  0250  which  pro¬ 
duces  graphs  with  a  regular  structure  that  resembles  many  DSP  applications.  We  modi¬ 
fied  Sih’s  algorithm  to  insure  that  the  random  connections  do  not  introduce  cycles,  and 
added  a  fanout  parameter  which  controls  the  amount  of  parallelism  in  the  graph.  Pseudo¬ 
code  for  the  random  graph  generation  algorithm  is  shown  in  the  Appendix.  An  example 
of  a  random  graph  generated  using  this  algorithm  is  shown  in  Figure  |5.1 1|. 

Figure  |5.12|  compares  the  WDM  ordered  transaction  heuristic  (using  the  fiber-based 
WDMOT  architecture)  to  the  TPO  heuristic  (using  a  shared  bus  architecture)  and  a  trans- 
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(a) 


(b) 


Figure  5.11:  Two  examples  of  random  graphs  generated  using  a  modification  of  Sih’s 
algorithm  |f)61j  with  70  nodes  and  fanout  5. 
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Comparison  of  ordering  heuristics 


wdm  + 
tpo  x 
topsort  * 


Figure  5.12:  Comparison  of  TPO  heuristic,  WDMOT  heuristic,  and  random  topological 


sort  ordering  for  a  set  of  random  graphs. 
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action  order  based  on  a  random  topological  sort  (using  a  shared  bus  architecture)  for  a 
set  of  randomly  generated  graphs.  The  y-axis  in  Figure  |5.12|  is  the  ratio  of  throughput 
to  the  maximum  possible  throughput  (the  reciprocal  of  the  MCM  of  Gipc)  (the  through¬ 
put  ratio )  that  would  be  obtained  for  a  full  crossbar  interconnect.  In  Figure  |5.12|  the 
ratio  of  average  execution  time  for  the  computation  actors  to  the  average  IPC  commu¬ 
nication  time  (the  computation  ratio )  was  2.5.  We  observe  that  the  throughput  using 
the  WDMOT  architecture  is  usually  very  close  to  this  maximum  in  these  graphs,  and 
generally  better  than  that  for  the  shared  bus.  We  also  observe  a  significant  improvement 
for  the  TPO  heuristic  over  the  topological  sort  ordering.  For  purposes  of  comparison 
we  assume  the  same  communication  times  for  both  optical  and  electrical  busses  in  this 
experiment.  In  practice  the  communication  times  for  the  optical  bus  would  be  lower, 
which  would  increase  the  relative  improvement  of  the  WDMOT  results. 

In  Figure  |5.13|  we  plot  the  average,  over  50  random  graphs,  of  the  throughput  ratio 
as  the  computation  ratio  is  varied.  We  see  that  the  WDMOT  architecture  produces 
throughput  very  close  to  the  theoretical  maximum  for  this  size  graph  (L  =  8  in  Sih’s 
algorithm  and  4  processors).  Also,  the  relative  performance  of  WDMOT  to  the  shared 
bus  increases  as  the  communication  overhead  increases  (lower  computation  ratio  on  the 
x-axis).  The  random  topological  sort  performs  significantly  worse.  Also,  there  is  no 
improvement  as  the  communication  overhead  decreases.  This  is  because  the  random 
topological  sort  is  imposing  an  inefficient  ordering  of  the  computation  nodes  as  well  as 
the  communication  nodes,  and  this  effect  is  dominant. 

In  Figure  |5.14|  we  plot  the  average,  over  50  random  graphs,  of  the  throughput  ratio 
as  the  number  of  processors  is  varied.  Again  we  see  that  the  WDMOT  architecture 
performs  close  to  the  theoretical  maximum.  We  also  observe  that  its  performance  is  less 
sensitive  to  the  number  of  processors. 
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WDMOT  with  graph  length  8  and  4  processors 


Figure  5.13:  Comparison  of  TPO  heuristic,  WDMOT  heuristic,  and  random  topological 
sort  ordering  vs.  computation  ratio.  Each  point  is  the  average  of  50  random  graphs.  The 
graph  length  was  8  and  the  application  was  scheduled  on  4  processors. 
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WDMOT  with  graph  length  10  and  communication  ratio  1 


Figure  5.14:  Comparison  of  TPO  heuristic,  WDMOT  heuristic,  and  random  topological 
sort  ordering  vs.  number  of  processors.  Each  point  is  the  average  of  50  random  graphs. 
The  graph  length  was  10  and  the  communication  ratio  was  fixed  at  1. 
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Chapter  6 


Scheduling  for  Arbitrarily  Connected  Systems 


Scheduling  for  multiprocessor  systems  was  introduced  in  Chapter  [Z].  A  vast  range  of 
scheduling  techniques  for  task  graphs  has  been  developed  (e.g.  see  [110 It]  for  a  review  of 
several  representative  approaches);  however,  these  techniques  typically  assume  a  fixed 
communication  network,  and  do  not  systematically  take  connectivity  constraints  into 
account.  By  connectivity  constraints,  we  mean  the  inability  of  certain  pairs  of  pro¬ 
cessors  to  communicate  with  each  other.  Such  constraints  are  desirable  to  impose  in 
optically  connected  multiprocessors  because  the  power  consumption  of  communication 
is  relatively  independent  of  distance,  and  largely  dependent  instead  on  the  number  of 
electrical-to-optical  conversions  that  must  be  performed  (this  will  be  discussed  further 
in  Section  FTZ|). 

Thus,  it  is  advantageous  to  configure  multiprocessor  schedules  in  such  a  way  that 
multi-hop  communication  is  avoided,  or  limited  to  some  maximum  number  of  hops  per 
communication  operation,  and  the  relative  abundance  of  communication  links  is  used 
instead  to  achieve  the  required  communication  flexibility.  However,  such  connectiv¬ 
ity  constraints  can  cause  list  scheduling  techniques,  and  related  methods  to  deadlock. 
One  contribution  of  this  thesis  is  to  develop  a  general  framework  for  extending  arbitrary 
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list  scheduling  approaches  to  avoid  deadlock,  and  operate  efficiently  in  the  presence  of 
connectivity  constraints.  We  will  apply  this  framework  to  jointly  streamline  the  com¬ 
munication  network  and  task  graph  mapping  for  a  given  application  in  Chapter  0.  This 
framework  can  be  used  both  for  minimum-cost  dedicated  implementations,  and  for  re- 
configurable  networks,  where  the  goal  is  to  save  power  consumption  by  activating  a 
minimal  subset  of  available  laser-detector  pairs. 

6.1  Implications  of  Increased  Connectivity 

It  has  been  shown  that  optical  interconnects  can  provide  a  higher  degree  of  connec¬ 
tivity  than  electrical  interconnects.  For  example,  Li  [fTTI]  claims  that  using  technology 
available  in  the  year  2000,  interconnects  of  up  to  1000x1000  I/O  elements  per  square 
centimeter  can  be  achieved.  In  this  thesis  we  explore  the  implications  of  various  levels 
of  connectivity  for  multiprocessor  systems,  and  interconnection  networks  that  can  make 
use  of  the  unique  properties  of  optical  interconnects. 

One  consequence  of  increasing  levels  of  connectivity  between  processors  is  that  it  is 
easier  for  mapping  algorithms  to  find  good  solutions.  We  show  here  that  the  quality  and 
number  of  solutions  found  by  a  probabilistic  search  algorithm  is  a  strong  function  of 
the  level  of  connectivity.  Connectivity  can  be  defined  in  the  following  way  for  a  system 
with  N  processors  {Pi,  P2,  •  •  • ,  -P/v}:  Let  Px  — >  Py  =  TRUE  iff  there  exists  a  direct 
communication  link  from  Px  to  PY.  Directional  connectivity  is  defined  as 

EE  lvalue  (Px  — > ►  Py) 
jv  ry.v 

Directionless  connectivity  is  defined  as 

EE  lvalue  ((Py  — >  Py)  V  (Py  — >  Px)) 

x  Y<X 
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Figure  6.1:  Example  of  directional  and  directionless  connectivity. 

where  Ivalue(TRUE)  =  1  and  lvalue  (FALSE)  =  0.  An  example  is  shown  in  Fig¬ 
ure  IQ. 

6.1.1  Topology  Graph 

We  define  a  topology  graph  T(<f>,  L )  such  that  the  nodes  of  T  correspond  to  the  “proces¬ 
sors”  <b  in  the  architecture  and  the  edges  L  in  T  correspond  to  direct  physical  commu¬ 
nication  links  between  the  processors.  We  define  the  set  of  all  processors  as  <f>  and  label 
the  processors  {pi,P2,  ■  ■  ■ ,  Then  T  contains  an  edge  (pi,Pj)  iff  the  interconnection 
network  provides  a  direct  (single-hop)  communication  link  from  pt  to  pr  If  l  is  an  edge 
in  T,  we  say  that  src(Z)  is  the  source  node  of  l;  snk(Z)  is  the  sink  node  of  l;  l  is  directed 
from  src(Z)  to  snk(Z);  l  is  an  output  edge  of  src(Z);  and  l  is  an  input  edge  of  snk(/).  We 
denote  the  degree  of  a  processor  by  the  number  of  incident  (physical)  communication 
links.  The  degree  of  a  node  v  in  T  is  equal  to  the  sum  of  the  number  of  input  and  output 
edges  of  v.  For  example,  each  processor  in  a  fully-connected  system  with  |$|  proces¬ 
sors,  has  degree  2(|$|  —  1)  (two  links — one  incoming  and  one  outgoing — to  each  other 
processor).  Furthermore,  a  path  in  T(<h,  L )  is  a  nonempty  sequence  l\,  l2,  h,  ■  ■  ■  €  L 
such  that  snk(Zi)  =  src(e2),  snk(e2)  =  src(e3), . . .  whose  path  length  equals  the  num- 
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ber  of  edges  in  the  sequence.  T  is  said  to  be  strongly  connected  if  for  each  pair  of  distict 
nodes  (pi,  p2)  there  is  a  path  directed  from  p\  to  p2  and  there  is  a  path  directed  from  p2 
to  J)\. 


6.1.2  Effect  of  Connectivity  on  a  Simple  Mapping  Algorithm 

We  begin  by  experimenting  with  a  very  simple  mapping  algorithm  in  order  to  observe 
the  effects  of  different  levels  of  connectivity.  In  this  experiment,  a  synthetic  aperture 
radar  application  with  60  tasks  was  mapped  onto  a  9  processor  heterogeneous  architec¬ 
ture  with  the  purpose  of  studying  the  resulting  connectivity  patterns.  The  goal  of  the 
mapping  algorithm  was  simply  to  determine  which  tasks  should  be  assigned  to  which 
processors,  not  the  relative  ordering  of  the  tasks  on  the  processors,  or  the  effects  of  inter¬ 
processor  communication  or  iterative  execution.  With  these  numbers,  the  search  space 
consists  of  approximately  2  •  105 '  distinct  mappings.  A  genetic  algorithm  was  used  to 
explore  this  space.  Performance  was  measured  as  a  function  of  connectivity  constraint 
(i.e.,  an  upper  limit  on  allowable  connectivity),  with  10  trials  for  each  connectivity  con¬ 
straint  point.  Figure  |0]  shows  the  number  of  valid  solutions  found  by  the  genetic  algo¬ 
rithm  vs.  connectivity.  Figure  |0]  shows  the  best  throughput  obtained  as  a  function  of 
the  connectivity  constraint.  It  can  be  seen  that  both  the  number  of  valid  solutions  and 
the  quality  of  these  solutions  increase  with  increased  amounts  of  connectivity. 

6.2  Connection  Topologies 

Electrically  connected  systems  generally  have  a  regular  interconnection  pattern,  due 
to  the  physical  constraints  imposed  by  two-dimensional  circuit  board  layout.  Some 
examples  include  ring,  mesh,  bus,  and  hypercube  interconnect  topologies.  Using  these 
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Valid  Solutions  (10  trials)  vs.  Connectivity  Constraint 


Figure  6.2:  Impact  of  Connectivity  on  Search  Efficiency. 


Best  Throughput  (10  trials)  vs.  Connectivity  Constraint 


Figure  6.3:  Impact  of  Connectivity  on  Performance. 
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topologies,  communication  between  remote  processors  requires  multiple  hops,  which 
increases  both  latency  and  power,  and  increases  contention  throughout  the  network. 

In  contrast,  optically  connected  multiprocessors,  particularly  those  utilizing  free 
space  optics  and  three  dimensions,  are  free  to  utilize  arbitrarily  irregular  interconnection 
networks.  Once  the  signal  is  in  the  optical  domain,  there  is  very  little  attenuation,  so  the 
energy  required  to  transmit  a  unit  of  data  is  essentially  independent  of  distance.  The 
required  energy  instead  is  a  function  of  the  number  of  electrical-to-optical  conversions 
that  must  be  performed  [E3Q,  which  in  turn  is  determined  by  the  number  of  hops.  Fur¬ 
thermore,  due  to  the  flexibility  of  the  communication  medium,  it  is  generally  possible 
to  avoid  multi-hop  communication  operations  by  simply  activating  direct  communica¬ 
tion  channels  between  the  source  and  destination  processors.  Together,  these  properties 
make  it  desirable  to  limit  the  number  of  hops  per  communication  operation  when  explor¬ 
ing  configurations  (interconnection  patterns  and  task  graph  mappings)  for  an  optically 
connected,  embedded  multiprocessor. 

Irregular  interconnection  patterns  arise  naturally  when  scheduling  task  graphs  under 
the  restriction  of  single-hop  communication.  A  simple  example  of  an  irregular  inter¬ 
connection  network  is  show  in  Figure  IQ).  Given  four  processors  and  four  bidirectional 
links,  there  are  two  possible  topologies  shown  in  Figure  |Q](a)  and  (b).  Topology  (a)  has 
a  regular  interconnection  pattern,  with  each  processor  connected  to  two  others.  Topol¬ 
ogy  (b)  is  irregular,  with  one  processor  having  degree  three,  one  processor  having  degree 
one,  and  the  others  having  degree  two.  Topology  (b)  allows  a  single-hop  schedule,  since 
all  required  communication  can  take  place  with  only  one  hop.  In  topology  (a),  two  hops 
are  required  for  communication  from  task  A  to  task  D  and  from  D  to  E. 

Task  graph  scheduling  algorithms  generally  produce  schedules  that  require  an  ir¬ 
regular  interconnect  topology  for  single-hop  communication.  For  example,  Figure  [o] 
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(b)  topology  2 


Figure  6.4:  (a)-(b)  Two  possible  topologies  with  four  processors  and  four  bidirectional 
links,  (c)  Application  graph. 

shows  the  application  graph  for  an  FFT  application  [ESJ.  This  application  was  sched¬ 
uled  on  eight  processors  using  the  DLS  algorithm  [ES3J,  with  no  constraints  made  on 
interconnect  placement.  Figure  shows  the  topology  required  to  operate  the  result¬ 
ing  schedule  using  only  single-hop  communication  operations.  There  are  14  directional 
links  out  of  a  possible  56  for  a  fully  connected  system  (the  ratio  of  these  two  numbers 
gives  a  measure  of  the  average  connectivity  of  each  processor). 

If  we  denote  the  degree  of  a  processor  by  the  number  of  incident  (physical)  com¬ 
munication  links,  each  processor  in  a  fully-connected  n  processor  system,  for  example, 
has  degree  2 (n  —  1)  (two  links — one  incoming  and  one  outgoing — for  each  processor). 
In  an  arbitrary  network,  the  relative  variation  in  the  degrees  among  different  processors 
gives  a  measure  of  the  level  of  irregularity  of  the  associated  interconnection  pattern.  For 
example,  in  the  mapping  of  Figure  |6.6|.  processors  0  and  6  have  degree  six,  while  pro- 
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/  for  DLS  schedule  of  FFT1  application. 


connectivity  requirements  averaged  over  100  applications  scheduled  on 


Figure  6.7:  Connectivity  requirements  of  100  benchmark  applications. 

cessors  3  and  5  have  degree  one.  This  trend  of  highly  irregular  connection  requirements 
occurs  over  a  wide  variety  of  task  graph  structures.  To  illustrate  this,  Figure  |577]  plots 
the  average  of  these  measures  over  100  real  and  synthetic  benchmark  application  graphs 
when  scheduled  on  different  numbers  of  processors.  The  synthetic  benchmarks  used 
in  these  experiments  were  generated  using  the  graph  generation  techniques  of  Sih  0550, 
which  are  designed  to  construct  task  graphs  that  resemble  the  dataflow  structures  found 
in  DSP  applications. 

As  motivated  earlier,  when  developing  automated  mapping  tools  for  optically  con- 
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nected  systems,  we  have  several  design  constraints.  It  is  desirable  to  map  the  application 
onto  the  architecture  without  requiring  multi-hop  communication,  while  satisfying  con¬ 
straints  on  system  throughput  and  latency.  We  also  have  limits  on  the  maximum  I/O 
fanout  and  degree  of  a  single  processor.  In  order  to  conserve  space  and  power,  we 
would  also  like  to  minimize  the  total  number  of  communication  links. 


6.3  Connectivity  and  Scheduling  Flexibility 

Due  to  the  desirability  of  single -hop  communications  in  optically  interconnected  mul¬ 
tiprocessors,  as  motivated  in  Sections  |0]  and  |Q|.  it  is  important  during  co-design  to 
employ  scheduling  techniques  that  carefully  take  into  account  the  connectivity  of  candi¬ 
date  interconnection  patterns.  In  systems  that  are  not  fully  connected,  the  consequence 
of  single-hop  communication  is  that  each  processor  p  can  only  send  data  to  some  subset 
x(p)  of  the  set  of  all  processors  $,  and  only  receive  data  from  a  subset  0(//)  of  <f>. 

If  these  constraints  are  not  considered,  deadlock  can  easily  occur  during  the  schedul¬ 
ing  process.  Consider  an  application  graph  G,  two  tasks  ip  and  z/2  in  G  that  have  been 
scheduled  on  processors  pi  and  p2,  respectively,  and  a  third  task  z/3  that  receives  data 
from  ui  and  z/2.  Then  if  X\{u\)  H  ^2(^2)  =  0,  the  scheduler  is  deadlocked. 

We  define  a  feasible  set  of  processors  T  [if  for  a  task  v  as  the  largest  subset  of 
on  which  v  can  be  scheduled  without  deadlock.  We  would  like  to  have  an  algorithm 
to  determine  the  feasible  set  of  processors  T  [v\  for  all  v  e  G.  In  general,  a  constraint 
imposed  on  one  task  (scheduling  it  on  a  processor)  may  cause  T  [u\  to  be  updated  for  all 
v  G  G.  This  update  consists  of  choosing  a  subset  of  the  set  'k  [v\  that  existed  before  the 
constraint — new  members  are  never  added. 

We  define  the  communication  flexibility  (or  simply  flexibility  for  short)  of  the  system 
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(a)  Processor  connection 


Partial  Schedule: 

A  on  processor  2 
B  on  processor  1 

Constraint  sets: 

T[A]  =  {2} 

T[B]  =  {1} 

T[F]  =  {1} 

T[ D ]  =  {1,2} 
¥[£]  =  {1,2,3} 
T[C]  =  {1,2,3} 

(c) 


Figure  6.8:  Example  requiring  constraint  information  propagating  both  forward  and 
backward. 

at  any  point  during  the  scheduling  process  as  the  sum  of  the  sizes  of  the  sets  \v)  for  all 
v  G  G.  The  flexibility  gives  some  measure  of  the  degree  of  constraint  imposed  on  all 
tasks  by  a  given  scheduling  move.  Figure  depicts  a  simple  example  of  an  application 
graph  with  six  tasks  being  scheduled  on  four  processors.  A  partial  schedule  is  shown 
in  Figure  |6i8|(c).  Scheduling  task  B  in  Figure  |Q|(b)  has  an  effect  on  tasks  C,  D,  E, 
and  F.  Figure  |6.8|(c)  shows  the  constraint  sets  T  for  each  task  after  scheduling  B  on 
processor  1 .  The  flexibility  at  this  point  is  equal  to  1 1 .  If  B  had  been  scheduled  on 
processor  2,  the  flexibility  would  be  16.  This  example  also  demonstrates  the  potential 
for  deadlock.  After  task  B  is  scheduled  on  processor  1,  processor  0  becomes  infeasible 
for  task  C,  since  scheduling  task  C  on  processor  0  confines  task  E  to  processor  0.  Task 
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Algorithm  6.1:  i?/1(S') 

input:  set  of  processors  S  £  T> 

$  is  the  set  of  all  processors 

output:  set  of  processors  R  that  can  be  reached  from  S  in  zero  or  one  hop 

R<-  S 

for  all  p  e  S 

{for  all £  (b 

/if  C(p,  z )  =  TRUE 
0  |  then  {i?  <—  R  U  {z} 

return  (R) 


Figure  6.9:  Function  Rf1(S). 

Algorithm  6.2:  i?61(5) 
input:  set  of  processors  S  €  $ 

$  is  the  set  of  all  processors 

output:  set  of  processors  R  that  can  reach  at  least  one  element  in  5  with  zero  or  one  hop 

R<—  S 

for  all  p  e  S 

{for  all  z  e  $ 

( if  C(z,p)  =  TRUE 
0  /  then  {i?<—  i?U{z} 

return  (R) 


Figure  6.10:  Function  Rb1  (S). 

F  is  confined  to  processor  1  since  B  is  scheduled  on  1 .  Task  D  sends  data  to  both  E 
and  F,  and  there  is  no  processor  which  can  communicate  with  both  processors  0  and  1 
in  a  single  hop.  Existing  scheduling  algorithms  are  not  designed  to  detect  this  deadlock 
condition.  Avoiding  these  deadlock  situations  is  not  trivial,  since  scheduling  one  task  in 
the  graph  may  possibly  constrain  any  other  task  in  the  graph. 

The  algorithm  described  in  Figures  |6T>1  |6.1(J|.  16.111.  |6. 1 2|.  and  |6.13L  works  by  prop¬ 
agating  constraint  information  forward  and  backward  through  the  application  graph  G. 
The  input  n  specifies  the  maximum  number  of  hops  allowed  for  two  processors  to  com¬ 
municate  with  each  other.  In  this  chapter  we  will  concentrate  on  single-hop  commu¬ 
nication,  where  n  —  1.  First  an  edge-reversed  copy  G  of  the  application  graph  G  is 
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Algorithm  6.3:  bfsForward(G,  s,  n,  endNodes,  bottomNodes,  F) 

input:  application  graph  G 

input:  node  s  in  G  being  considered 

input:  set  of  discovered  endNodes 

input:  maximum  hop  communication  allowed  n 

output:  stack  of  newly  discovered  bottomNodes 

input/output:  array  F  of  sets  of  feasible  processors  for  each  node  in  G 

local  variables:  queue  of  nodes  Q,  array  dist  of  distances  for  each  node 

local  variables:  set  R  of  processor  numbers,  application  graph  nodes  w,  v 

for  all  w  G  G 

do  {dist [in]  =  —  1  dist [s]  =  0 

Q  <-  W 

while  (Q  /  0) 

'  v  =  head(Q) 
for  all  w  G  Adj  [n] 

'  ]<° 

enqueue(Q,  w) 
dist[m]  =  dist  [v]  +  1 
F[w\  =  F[w]  n  Rfn(F[v}) 
if  outdegree(m)  =  0 
then  {puSH(m,  bottomNodes) 


do  < 


do  < 


if  dist  [w 


then  < 


Figure  6.11:  Function  bfsForward(). 
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Algorithm  6.4:  bfsBackward(G,  H,  s,  n,  endNodes,  topNodes,  F) 


input:  edge-reversed  application  graph  G 

comment:  for  every  node  u  £  G,v  =  G[u]  gives  corresponding  node  in  G 
input:  array  of  nodes  H 

comment:  for  any  node  v  £  G,u  =  H[v\  references  corresponding  node  in  G 
input:  node  s  in  G  being  considered 
input:  set  of  discovered  endNodes 


input:  maximum  hop  communication  n  allowed 
output:  stack  of  newly  discovered  topNodes 

input/output:  array  F  of  sets  of  feasible  processors  for  each  node  in  G 
local  variables:  queue  of  nodes  Q,  array  dist  of  distances  for  each  node 
local  variables:  set  R  of  processor  numbers,  application  graph  nodes  w,  v,  s 
local  variables:  nodes  in  G:  w,  v 


for  all  w  £  G 

do  {dist[tn]  =  —  1  s  —  H[s] 


dist[s]  =  0 

Q  {s} 

while  ( Q  ^  0) 

[  v  —  head(Q) 
for  all  w  £  Adj[v] 

( if  dist  [vj]  <  0 


do  < 


do  < 


f  ENQUEUE(<5,n;) 
w  —  H[w ] 
v  -  H[v] 

then  {  dist[rn]  =  dist[u]  +  1 

F[w]  -  F[w\  n  Rbn(F[v]) 
if  outdegree(tn)  =  0 
[  then  {push(u;,  topNodes) 


Figure  6.12:  Function  bfsBackwardQ. 
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Algorithm  6.5:  FEASIBLE(G,  G,  H,s,p,  n,  F,  commit) 

input:  application  graph  G 

input:  edge -reversed  application  graph  G 

comment:  for  every  node  u  €  G,  v  =  G[u]  gives  corresponding  node  in  G 
input:  array  of  nodes  H 

comment:  for  any  node  v  €  G,u  —  H[v\  references  corresponding  node  in  G 

input:  node  s  in  G  being  considered 

input:  processor  p  being  considered 

input:  maximum  hop  communication  n  allowed 

input:  boolean  value  commit  determines  if  changes  to  F  are  saved 

input/output:  array  F  of  sets  of  feasible  processors  for  each  node  in  G 

local  variables:  local  copy  of  F  Flocal,  set  of  discovered  endNodes 

local  variables:  stack  of  nodes  topNodes,  application  graph  nodes  v,  Vf,  bf 

local  variables:  sets  of  processor  numbers  f  and  r 

if  p  F[s] 
then  {return— 1 
Flocal  —  F 
Flocal[s]  =  {p} 

PUSH(topNodes,  s ) 
while  topNodes  ^  0 

while  topNodes  yf  0 

{POP(topNodes,  Vf) 
lNSERT(endNodes,  Vf) 

bfsForward(G,  s,  n,  endNodes,  bottomNodes,  Flocal) 
while  bottomNodes  ^  0 

{POP(bottomNodes,  Vb) 
lNSERT(endNodes,  Vb) 

bfsBackward(G,  H,  s,  n,  endNodes,  topNodes,  Flocal) 

if  commit 
then  {F  =  Flocal 
flexibility  =  0 

for  all  v  €  G 

do  {flexibility  =  flexibility  +  size(Flocal) 
return  (flexibility) 


Figure  6.13:  Function  feasible()  determines  feasibility  and  flexibility  (degree  of  con¬ 
straint)  for  scheduling  task  s  on  processor  p. 
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created.  When  making  a  scheduling  move  (introducing  a  new  constraint  at  a  task  s ),  the 
constraint  information  is  propagated  forward  using  breadth  first  search  from  s  through 
G.  When  an  endnode  (task  with  no  successors)  is  discovered  during  the  forward  phase 
for  the  first  time,  it  is  added  to  a  stack  (named  endNodes). 

At  the  end  of  the  forward  phase,  the  backward  phase  begins.  Each  endnode  is  re¬ 
moved  from  the  stack  and  the  constraint  information  is  propagated  backward  by  per¬ 
forming  breadth  first  search  from  the  endnodes  through  H.  While  propagating  back¬ 
ward,  newly  discovered  endnodes  of  H  are  added  to  a  second  stack.  These  endnodes 
are  removed  from  the  stack,  and  search  continues  in  the  forward  direction.  The  process 
continues  until  there  are  no  newly  found  endnodes. 

We  define  Rfl(S )  and  Rbl  (S)  for  sets  of  processors  reachable  from  S  in  one  hop. 
Then  for  multiple  hops 

Rf2(S)  —  Rf1(Rf1(S)) 

Rb\S )  =  Rb\Rb\S)) 

Rfn(S)=Rf1{Rfn-1(3)) 

Rbn(S )  =  Rb\Rbn~\S)) 


We  define  the  functions  bfsForward()  and  bfsBackward()  which  use  breadth  first 
search  to  propagate  constraint  information  for  a  task  s  in  a  graph  G  in  the  forward  and 
backward  direction. 

The  feasible()  function  described  in  Figure  |6.13|  returns  an  integer  equal  to  the 
sum  of  the  sizes  of  the  constraint  sets  for  all  nodes  in  the  application  graph  G  when 
scheduling  a  task  G  on  a  processor  p,  given  an  input  n  corresponding  to  the  maximum 
number  of  communication  hops  allowed.  If  s  is  not  feasible  on  p,  the  function  returns 
-1. 
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(scheduling  step)  at  specified 

feasible  sets 

point  in  feasible 

(1)  schedule  A  on  2:  end  of 

A: [2]  B : [  1 ,2,3]  C:[0, 1,2,3]  D:[l,2,3] 

feasible() 

E:[0, 1,2,3]  F:[0, 1,2,3] 

(2)  schedule  B  on  1:  after 

A:  [2]  B:[l]  C:[0,l,2,3]  D:[l,2,3] 

bfsForward()  from  B 

E:[0, 1,2,3]  F: [1] 

(2)  schedule  B  on  1:  after 

A:  [2]  B:[l]  C:[0, 1,2,3]  D:[l,2] 

bfsBackward()  from  F 

E:[0, 1,2,3]  F:  [1] 

(2)  schedule  B  on  1:  after 

A:  [2]  B:[l]  C:[0, 1,2,3]  D:[l,2] 

bfsForward()  from  A 

E:  [1,2,3]  F:  [1] 

(2)  schedule  B  on  1:  after 

A:  [2]  B:[l]  C:[l,2,3]  D:[l,2] 

bfsBackward()  from  E 

E:  [1,2,3]  F:  [1] 

(3)  schedule  C  on  3:  end  of 

A: [2]  B : [  1]  C:[3]  D:[2]  E:[3]  F:[l] 

feasible() 

(3  alternate)  sched  C  on  1:  end  of 

A: [2]  B:[l]  C:[l]  D:[l,2]  E:[l]  F:[l] 

feasibleQ 

Table  6.1:  Feasible  sets  at  various  points  during  scheduling. 
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Table  |0]  lists  the  constraint  sets  for  the  tasks  in  the  example  of  Figure  [Df]  at  various 
stages  of  the  feasible  function.  We  are  scheduling  A  on  2,  B  on  1,  and  C  on  either  3 
or  1.  Here  we  can  see  that  after  scheduling  task  B  on  processor  1,  processor  0  is  not  a 
feasible  choice  for  task  C.  In  Table  |6.  I|.  the  last  row  corresponds  to  the  second  choice 
of  scheduling  C  on  processor  1 . 

6.4  Complexity  of  the  Constraint  Algorithm 

The  bfsForward  function  will  be  called  once  for  the  task  s  being  considered,  and 
once  for  each  task  in  G  with  no  predecessors  (endnode  in  G).  The  bfsBackward 
function  will  be  called  once  for  each  task  in  G  with  no  successors  (endnode  in  G).  The 
complexity  of  breadth  first  search  is  0(v  +  e)  for  a  graph  with  v  nodes  and  e  edges. 
The  bfsForward  and  bfsBackward  functions  require  a  set  intersection  of  two  sets 
of  size  0(N )  where  N  is  the  number  of  processors  in  the  system.  This  has  complexity 
0(N log AT).  Functions  Rb11  and  Rfn  also  have  complexity  0(Ar  log  Ar).  The  overall 
complexity  is  therefore 

0(v(v  +  e)N\og  N)  (6.1) 

This  is  a  reasonable  complexity  figure  in  the  embedded  systems  domain,  where  com¬ 
pile/synthesis  time  tolerance  is  significantly  higher  compared  to  general-purpose  com¬ 
putation  (e.g.,  see  [d!]). 

For  interconnection  graphs  that  are  strongly  connected,  such  as  those  in  which  all 
links  are  bidirectional,  Rfn(S)  =  $  (the  set  of  all  processors)  and  Rbn(S)  =  $  after 
some  number  of  hops  h  <  N,  and  the  breadth  first  searches  do  not  need  to  proceed  for 
distances  further  than  h  where  h  is  the  maximum  hop  constraint  given  beforehand.  In 
this  case  the  complexity  is  Q(vhN  log  N). 
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6.5  Incorporating  Feasibility  and  Flexibility  into  Schedul¬ 


ing 

The  general  class  of  list  scheduling  algorithms  can  easily  be  adapted  to  produce  single¬ 
hop  (or  n-hop)  schedules  by  incorporating  our  constraint  algorithm.  This  is  advanta¬ 
geous  because  it  allows  us  to  leverage  a  large  library  of  useful  scheduling  techniques. 

In  list  scheduling,  a  priority  list  L  of  tasks  is  constructed.  The  priority  list  L  is  a 
linear  ordering  (iq,  z/2, . . . ,  v\V\)  of  the  tasks  in  the  application  graph  G  =  ( V ,  E)  such 
that  for  any  pair  of  distinct  tasks  is,  and  u;) ,  is,  is  to  be  given  higher  scheduling  priority 
than  Uj  if  and  only  if  i  <  j.  Each  task  is  mapped  to  an  available  processor  as  soon  as 
it  becomes  the  highest-priority  task  according  to  L  among  all  tasks  that  are  ready.  This 
process  is  repeated  until  all  tasks  are  scheduled. 

The  concepts  of  feasibility  and  flexibility,  which  where  developed  in  Section  |Q|, 
can  be  incorporated  into  the  general  framework  of  list  scheduling  by  restricting  the  set 
of  candidate  processors  to  include  only  those  that  are  feasible  at  the  given  scheduling 
step,  and  by  taking  flexibility  into  account  in  designing  the  priority  metric  through  which 
tasks  are  ordered. 

In  the  context  of  single-hop  communication  across  arbitrary  interconnection  pat¬ 
terns,  the  incorporation  of  feasibility  considerations  is  required  (to  avoid  scheduler 
deadlock,  as  discussed  in  Section  |Q|),  while  incorporation  of  flexibility  is  optional. 
Furthermore,  there  are  many  possible  ways  to  consider  flexibility  in  the  task  prioriti¬ 
zation  process.  We  show  in  Section  [TE]  that  even  simple  techniques  for  incorporating 
flexibility  information  can  lead  to  large  performance  improvements  for  a  targeted  class 
of  architectures. 
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6.6  Scheduling  Experiments  using  Flexibility 


As  mentioned  earlier,  our  scheduling  technique  operates  in  conjunction  with  a  given 
list  scheduling  strategy.  In  our  experiments,  we  employed  the  DLS  algorithm  [E23J  as 
the  underlying  list  scheduling  strategy,  although,  as  explained  in  Section  |Oj  any  list 
scheduling  algorithm  could  have  been  used. 

We  examined  a  set  of  DSP  application  benchmarks  and  scheduled  them  using  two 
different  scheduling  modes,  one  that  incorporates  only  feasibility  information  (to  avoid 
deadlock),  and  another  that  takes  both  feasibility  and  flexibility  into  account.  We  refer 
to  these  as  the  feasibility-only  and  feasibility-flexibility  modes,  respectively.  To  evaluate 
the  performance  across  a  range  of  connectivity  levels,  we  scheduled  the  applications 
onto  networks  with  varying  degrees  of  connectivity. 

In  the  feasibility-only  mode,  the  processor  P  considered  for  a  given  task  v  at  each 
scheduling  step  was  restricted  to  be  in  the  feasible  set  T  [v]  for  v,  as  described  in  Sec¬ 
tion  |Q|,  and  no  modification  was  made  to  the  task  prioritization  metric  of  the  underlying 
list  scheduling  strategy  (DLS). 

In  the  feasibility-flexibility  mode,  the  processor  P  considered  at  each  scheduling 
step  was  again  restricted  to  be  in  the  feasible  set  for  v\  however,  whenever  two  processor 
assignments  for  v  resulted  in  equal  priority  levels  L(u),  where  L  represents  the  priority 
metric  of  the  original  DLS  algorithm,  priority  was  given  to  the  assignment  that  resulted 
in  a  higher  value  of  flexibility.  In  other  words,  priority  was  given  to  assignments  that 
offered  greater  flexibility  for  future  scheduling  decisions. 

For  each  application,  we  chose  a  number  N  of  processors,  then  generated  a  fully 
connected  network  with  N(N  —  1)  links.  We  scheduled  the  application  using  both 
feasibility-only  and  feasibility-flexibility  modes  onto  this  network.  Next  we  removed 
one  link  from  the  network  at  random,  and  again  scheduled  the  application  using  both 
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1 


FFT1  -  7  processors  -  random  link  removal 


Figure  6.14:  Makespans  for  schedules  constructed  using  DLS  plus  flexibility  computa¬ 
tion  with  and  without  considering  the  processor  flexibility  metric. 

scheduling  modes.  We  continued  this  process  of  removing  links  until  no  links  remained, 
resulting  with  all  the  tasks  scheduled  on  a  single  processor.  We  define  the  relative  im¬ 
provement  of  the  feasibility-flexibility  mode  over  the  feasibility-only  mode  by  compar¬ 
ing  the  average  makespan  over  all  link  configurations. 

The  result  of  this  experiment  for  an  FFT  application  is  shown  in  Figure  |6.14|.  If 
we  compare  the  average  makespan  for  the  schedules  generated  by  feasibility-flexibility 
mode  (the  top  curve  in  Figure  |6.14Q  with  the  average  makespan  of  the  schedules  gen- 
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Application 

N 

Improvement  %) 

FFT1 

7 

30 

KarplO 

6 

26 

Irr 

8 

17 

Qmf4 

7 

19 

NN 16-3-4 

8 

21 

Suml 

6 

8 

Laplace 

7 

23 

FFT2 

7 

15 

Table  6.2:  Relative  makespan  improvement  obtained  by  using  flexibility  information  in 
the  scheduling  process. 

erated  without  incorporating  flexibility  (the  bottom  curve  in  Figure  |6.14j)  we  see  a  30% 
relative  improvement  when  the  scheduling  algorithm  incorporates  the  flexibility  metric. 

Table  [T2|  summarizes  this  relative  improvement  for  several  other  DSP  applications. 
We  performed  experiments  with  the  following  application  graphs:  FFT1,  Irr,  FFT3, 
KarplO,  Qmf4,  Laplace,  Suml,  and  NN16-3-4.  The  FFT  graphs  are  different  implemen¬ 
tations  of  the  fast  Fourier  transform  from  Kahn  [5SJ  and  contain  28  nodes  each.  KarplO 
refers  to  the  Karplus-Strong  music  synthesis  algorithm  with  10  voices  (21  nodes),  and 
Qmf4  is  a  quadrature  mirror  filter  bank  with  14  nodes.  Laplace  is  a  Laplace  transform, 
Irr  is  an  adaptation  of  a  physics  algorithm,  and  suml  is  an  upside  down  binary  tree  rep¬ 
resenting  the  sum  of  products  computation.  A  neural  network  classifier  algorithm  with 
16  input  nodes,  3  intermediate  layers,  and  4  output  nodes  labeled  NN6-3-4  was  also 
tested. 


117 


Application 

N 

Reduction  in  comm.  energy(%) 

Makespan  increase 

for  single  hop 

FFT1 

7 

16 

8 

Karp  10 

6 

24 

4 

Irr 

8 

16 

(2) 

Qmf4 

7 

32 

3 

NN 16-3-4 

8 

58 

2 

Suml 

6 

1 

4 

Laplace 

7 

4 

(3) 

FFT2 

7 

12 

2 

Table  6.3:  Reduction  in  communication  of  single  hop  schedule  over  three-hop  schedule. 

6.6.1  Power  Reduction  with  Single  Hop  Communication 

As  mentioned  earlier,  it  is  advantageous  to  limit  interprocessor  communication  to  a  low 
number  of  hops  because  the  energy  required  is  proportional  to  the  number  of  electrical- 
to-optical  conversions.  In  order  to  quantify  this  effect,  we  scheduled  the  benchmark 
applications  using  our  modified  scheduling  technique,  which  takes  the  number  of  hops 
as  an  input  parameter.  We  scheduled  the  benchmarks  with  hop  constraints  of  one  hop 
and  three  hops,  and  compared  the  communication  energy  required.  For  our  purposes, 
we  assumed  all  communication  tasks  transferred  the  same  number  of  bits,  so  the  energy 
cost  of  all  IPC  actors  was  equal.  With  a  three-hop  limit,  the  scheduler  is  free  to  choose 
any  communication  path  that  involves  three  or  fewer  hops  and  is  thus  less  constrained 
in  its  scheduling  choices  than  with  a  one-hop  limit.  Table  |0]  shows  the  reduction  in 
the  required  communication  energy  for  single-hop  schedules  over  three-hop  schedules 
for  the  benchmark  applications.  We  would  expect  that  in  general  the  schedules  con- 


118 


structed  with  the  three-hop  limit  would  have  a  lower  makespan,  since  the  scheduler  is 
less  constrained — the  set  of  moves  available  to  the  scheduler  at  any  point  using  three 
hops  is  a  superset  of  the  moves  available  when  limited  to  one  hop.  For  these  bench¬ 
marks,  however,  we  found  that  any  undesirable  effect  of  the  additional  constraint  for 
single-hop  schedules  was  very  small,  as  can  be  seen  in  Table  |6J|.  In  two  of  the  bench¬ 
marks  (Irr  and  Laplace),  the  makespan  was  in  fact  better  (lower)  when  we  limited  the 
scheduler  to  single  hops. 

6.7  Summary  of  Flexibility  Work 

Optical  interconnect  technology  is  promising  for  global  communication  in  embedded 
multiprocessors,  since  the  interconnection  patterns  can  flexibly  be  streamlined  and  re¬ 
configured  to  match  the  target  applications.  However,  due  to  the  power  consumption 
characteristics  of  optical  links,  it  is  useful  to  restrict  communication  across  them  to 
low-hop  transfers.  We  have  demonstrated  an  effective  algorithm  for  determining  the 
set  of  feasible  processors  that  will  avoid  schedule  deadlock  in  a  single-hop  schedule, 
and  a  useful  metric,  called  communication  flexibility,  for  the  degree  to  which  a  given 
scheduling  decision  constrains  future  decisions  (in  the  context  of  the  given  communi¬ 
cation  topology).  We  used  this  algorithm  and  the  flexibility  metric  in  conjunction  with 
the  DLS  algorithm  to  map  several  DSP  applications  across  a  wide  range  of  interconnect 
topologies.  The  results  depicted  in  Figure  |6.14|  and  [TZ|  demonstrate  both  the  sound¬ 
ness  of  our  feasibility  computation  techniques,  and  the  utility  of  our  flexibility  metric  in 
guiding  the  scheduling  process. 


119 


Chapter  7 


Synthesizing  an  Efficient  Interconnect  Network 


Embedded  systems  typically  run  a  limited  and  fixed  set  of  applications.  We  can  use 
this  application-specific  information  to  optimize  the  interconnection  network.  For  our 
purposes,  an  optimal  network  is  defined  in  the  context  of  a  set  of  applications  and  con¬ 
straints.  The  constraints  may  include  the  latency,  throughput,  and  power  consumption 
for  the  given  applications,  along  with  cost  and  area  constraints  of  the  overall  system. 

This  problem  is  important  for  today’s  system-on-chip  (SoC)  designs  utilizing  elec¬ 
tronic  interconnects  as  well  as  future  designs  that  might  utilize  optical  interconnects. 
SoC  design  is  moving  toward  a  paradigm  where  reusable  components  called  IP  (for  in¬ 
tellectual  property)  from  different  vendors  can  be  combined  to  rapidly  create  a  design. 
IP  interface  standards  are  being  developed  that  define  the  services  one  IP  component 
(or  IP  block )  is  capable  of  delivering,  and  that  enable  IP  blocks  to  work  with  on-chip 
buses  and  other  interconnection  networks.  The  SoC  designer’s  task  is  then  to  choose 
the  appropriate  IP  blocks,  map  the  application  tasks  onto  these  blocks,  and  to  construct 
a  communication  network  and  corresponding  glue  logic  to  connect  these  IP  blocks.  As 
transistor  density  increases,  more  IP  blocks  can  be  placed  on  a  single  chip,  and  the  num¬ 
ber  of  possible  interconnections  (links)  between  them  increases.  The  longest  wires  on 
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the  chip  are  usually  due  to  these  links.  These  wires  contribute  to  delay  and  limit  the 
maximum  achievable  clock  rate.  Also,  routing  these  interconnections  is  a  significant 
challenge  for  the  EDA  tools.  Therefore,  if  we  can  minimize  the  number  of  links  re¬ 
quired  in  the  high  level  design  stage,  placement  and  routing  can  be  improved  in  the  back 
end  of  the  design  process  and  performance  will  increase. 

In  a  system  utilizing  optical  interconnects,  cost  and  area  constraints  dictate  the  total 
number  of  transmitters  and  receivers  in  the  system  (i.e.,  total  number  of  optical  links). 
Routing  constraints  from  local  partitions  to  their  associated  VCSEL  transmitters  and 
detectors  dictate  a  maximum  fanout  for  each  local  partition.  An  optimum  interconnect 
is  then  one  that  minimizes  the  number  of  links  while  enabling  the  application  to  meet 
the  power,  latency,  and  throughput  constraints. 

Realistic  optical  networks  may  incorporate  relatively  high,  but  not  necessarily  com¬ 
plete  (fully  connected),  levels  of  connectivity.  Even  in  fully-connected  systems,  such  as 
FAST-Net  [SHI],  it  is  still  desirable  from  the  viewpoint  of  power  and  heat  dissipation  to 
have  a  minimal  interconnect  mapping,  since  for  a  given  application,  non-essential  trans¬ 
mitters  can  be  turned  off.  In  other  optical  processing  implementations,  the  interconnect 
network  can  be  reconfigured  between  applications  [53] . 

The  freedom  to  optimize  interconnection  patterns  opens  up  a  vast  design  space,  and 
thus  the  design  of  an  optimal  interconnect  structure  for  a  given  application  or  set  of 
applications  is  a  significant  challenge.  In  this  chapter,  we  illustrate  both  probabilistic 
and  deterministic  interconnection  synthesis  algorithms.  A  key  distinguishing  feature  to 
our  interconnect  synthesis  algorithms  is  that  they  work  in  conjunction  with  a  scheduling 
strategy — most  existing  interconnect  synthesis  algorithms  assume  a  given  schedule. 
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7.1  Greedy  Interconnect  Synthesis  Algorithm 


We  developed  a  greedy,  heuristic  algorithm,  which  we  call  the  two-phase  link  adjustment 
(TPLA)  algorithm  [0],  to  synthesize  an  interconnect  and  an  associated  multiprocessor 
schedule  for  a  given  application.  The  TPLA  algorithm  starts  with  a  fully  connected  net¬ 
work,  and  operates  in  down  and  up  phases.  Input  to  the  algorithm  is  either  a  makespan 
constraint  for  the  application,  or  a  constraint  on  the  total  number  of  links. 

Each  step  of  the  down  phase  in  TPLA  removes  one  link,  while  each  step  of  the  up 
phase  adds  one  link.  One  step  of  the  down  phase  consists  of  assigning  each  existing  link 
a  score  based  on  the  schedule  makespan  resulting  from  its  removal,  and  removing  the 
link  with  the  lowest  score.  A  history  of  scores  is  kept  for  each  link.  For  the  first  pass 
through  the  down  phase,  ties  between  links  are  broken  randomly.  On  subsequent  phases, 
the  link  history  is  used  to  break  ties.  The  down  phase  continues  until  all  the  links  are 
removed. 

Conversely,  one  step  of  the  up  phase  in  TPLA  consists  of  assigning  a  score  to  each 
missing  link  based  on  the  makespan  resulting  from  its  addition.  The  up  phase  continues 
until  the  network  is  fully  connected.  Repeated,  alternating  invocations  of  down  and  up 
phases  are  executed  for  some  time  limit  (determined  by  the  user),  and  the  best  result 
found  is  taken.  Given  a  makespan  constraint,  this  best  result  minimizes  the  number  of 
links.  Alternatively,  given  a  constraint  on  the  number  of  links,  the  best  result  minimizes 
the  makespan. 

7.1.1  Experiments  with  TPLA 

We  evaluated  the  TPLA  algorithm  on  a  neural  network  classifier  application  called 
RBFNN,  consisting  of  16  input  nodes,  3  intermediate  layers,  and  4  output  nodes.  This 
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neural  network  16input  3hiddenLayer  4output  -  8  processors 


Figure  7.1:  Link  synthesis  using  the  TPLA  algorithm. 

benchmark  was  chosen  in  part  since  it  exhibits  a  large  amount  of  inter-processor  com¬ 
munication.  The  scheduling  algorithm  used  was  the  DLS  algorithm  [fiblj  modified  to 
incorporate  the  flexibility  metric,  as  detailed  in  Section  ^3|.  The  bottom  curve  of  Fig¬ 
ure  o  shows  the  best  makespan  achieved  for  each  level  of  connectivity  between  0  and 
fully  connected,  after  one  down  phase  and  one  up  phase.  This  gives  a  Pareto  curve  of 
the  trade-off  between  number  of  links  and  makespan  for  the  application.  For  purposes 
of  comparison,  the  upper  curve  of  Figure  [77T]  shows  the  makespan  achieved  by  starting 
with  fully  connected  and  randomly  removing  one  link  at  a  time.  The  TPLA  algorithm 
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shows  a  significant  improvement  (42%  relative  improvement)  over  random  removal, 
and  is  thus  a  promising  starting  point  for  developing  more  sophisticated  link  synthe¬ 
sis  algorithms.  More  broadly,  it  demonstrates  the  effectiveness  of  joint  scheduling  and 
interconnect  synthesis. 

7.2  Link  Synthesis  using  Genetic  Algorithm 

We  developed  a  genetic  algorithm  (GA)  based  interconnect  synthesis  algorithm.  This 
algorithm  also  employs  the  dynamic  level  scheduling  (DLS)  algorithm  [0]  modified  for 
arbitrary  interconnection  networks  as  the  underlying  list  scheduling  strategy,  although 
any  list  scheduling  algorithm  could  have  been  used.  The  algorithm  takes  into  account 
constraints  on  the  total  number  of  links  Zmax  and  a  maximum  fanout  for  each  processor 
/max,  as  described  earlier  and  motivated  by  area  and  cost  constraints  for  the  system. 

7.2.1  Genetic  Algorithm  Overview 

Genetic  algorithms  will  be  described  in  more  detail  in  Chapter  |8] — we  give  a  brief 
overview  here  in  order  to  explain  the  link  synthesis  algorithm.  When  a  genetic  algo¬ 
rithm  is  used  to  solve  an  optimization  problem,  it  is  necessary  to  be  able  to  represent  a 
single  solution  to  the  problem  with  a  single  data  structure.  This  representation  is  often 
called  a  chromosome  or  an  individual.  The  quality  or  fitness  of  a  given  solution  is  eval¬ 
uated  using  an  objective  function.  Genetic  algorithms  are  capable  of  both  broad  search 
(exploration)  and  local  search  (exploitation)  of  a  search  space.  They  are  often  preferred 
than  gradient  search  methods  because  they  avoid  local  minima,  and  do  not  require  a 
smooth  search  space. 

The  basic  steps  of  a  genetic  algorithm  are  shown  in  Figure  [772].  The  genetic  algo- 
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Initialize  population 


Figure  7.2:  Basic  steps  of  a  GA. 
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Figure  7.3:  Crossover  operator  applied  to  array  chromosome. 

rithm  creates  an  initial  population  of  candidate  solutions  using  an  initialization  operator. 
Often  the  initial  population  is  distributed  randomly  over  the  search  space.  The  genetic 
algorithm  first  selects  individuals  from  the  population  and  performs  crossover  and  muta¬ 
tion  operations  on  these  individuals.  Traditional  crossover  generates  two  children  from 
two  parents  in  a  population.  This  is  depicted  in  Figure  [773]  for  a  chromosome  whose  rep¬ 
resentative  data  structure  is  an  array.  A  crossover  point  is  chosen,  shown  by  the  dashed 
vertical  line  in  Figure  [73]  and  the  child  chromosome  is  formed  by  the  elements  from 
the  first  parent  chromosome  to  the  left  of  the  crossover  point  and  the  elements  from 
the  second  parent  to  the  right  of  the  crossover  point.  The  mutation  operator  specifies 
a  procedure  for  changing  (mutating)  an  individual.  The  specifics  of  the  mutation  de¬ 
pend  on  the  data  structure  used  to  represent  an  individual.  A  typical  mutation  operator 
for  an  individual  represented  by  a  binary  string  flips  the  bits  in  the  string  with  a  given 
probability  (the  mutation  probability).  One  generation  of  a  genetic  algorithm  consists 
of  performing  crossover  and  mutation  on  individuals  in  the  population.  There  are  many 
possibilities  for  evolving  the  population.  A  simple  GA  uses  non-overlapping  popula- 
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tions.  Each  generation  creates  an  entirely  new  population  of  individuals.  A  steady  state 
GA  uses  overlapping  populations,  in  which  a  fraction  of  the  population  is  replaced  in 
each  generation.  In  an  incremental  GA  each  generation  consists  of  only  one  or  two 
children. 

7.2.2  Problem  representation 

In  our  algorithm,  the  individuals  are  bit  vectors  corresponding  to  a  given  interconnect 
topology.  The  fitness  function  for  a  chromosome  in  our  interconnect  synthesis  algorithm 
is  described  by 


fitness  =  M(1  +  Pf  +  Pi) 


(7.1) 


where  M  is  the  makespan  (latency)  calculated  by  the  modified  DLS  algorithm  for  the 
interconnect  topology  of  the  chromosome,  Pf  (equation  [/.('ll  is  a  penalty  based  on  vio¬ 
lating  the  fanout  constraint  /max,  and  Pi  (equation  [T. 7|)  is  a  penalty  based  on  violating 
the  maximum  link  constraint  Zmax. 

We  define  a  link  vector  as  a  bit  vector  with  one  entry  for  each  possible  intercon¬ 
nection  between  two  processors.  For  a  system  with  N  processors,  there  are  N(N  —  1) 
entries  in  the  link  vector.  The  link  vector  for  a  four  processor  system  would  be  denoted 
as 


(7.2) 


where  ll3  equals  one  if  there  is  a  connection  from  processor  i  to  processor  j  and  zero 
otherwise.  We  define  ll;j  =  0  if  i  —  j.  We  also  write  l  as 


l  —  ( loh  . . .  In-i  ) 


(7.3) 


where  Ik  describes  the  (outgoing)  connections  for  processor  k.  We  will  refer  to  the  Ik  as 
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processor  link  vectors.  We  define  the  fanout  of  processor  i  by 

TV— 1 

/<  =  =  ft  (7-4) 

3=  0 

Then  the  number  of  links  is  given  as 

TV-1 

ni  =  Y,f*  (7-5) 

i= 0 

while  the  fanout  penalty  is  given  by 

TV— 1 

pl  =  'EF‘  (7-6> 

i= 0 

where  P,  =  max(0,  (/,;  —  /max))-  The  link  penalty  is  given  by 

Pi  =  max(0,  (m  -  /max))-  (7.7) 

7.2.3  Fanout  Constraints 

In  a  real  system,  cost  and  area  constraints  will  place  a  limit  on  the  processor  fanout. 
For  example,  in  a  free-space  optical  system  such  as  FAST-Net  [SEO,  each  link  requires  a 
dedicated  VCSEL/photoreceiver  pair.  In  the  WDM-based  system  proposed  in  Chapter 
0.  a  separate  wavelength  is  required  to  transmit  to  each  processor,  and  each  processor 
requires  a  tunable  source.  In  this  case  there  is  a  physical  limit  on  the  number  of  re¬ 
solvable  wavelengths,  given  by  n  =  B / V  where  B  is  the  fiber  bandwidth  and  F  is  the 
channel  spacing.  Cost  constraints  may  also  limit  the  number  of  wavelengths  allowed 
for  the  tunable  sources.  For  today’s  WDM  systems,  F  =  50GHz  while  B  ~  4000GHz 
corresponding  to  the  wavelength  range  from  1530  to  1565nm  (C  band)  in  a  fiber.  This 
yields  80  channels.  In  order  to  achieve  such  narrow  channel  spacing,  the  temperature  of 
the  laser  transmitter  must  be  carefully  controlled.  A  lower  cost  variant  to  WDM,  called 
coarse  wavelength  division  multiplexing  (CWDM)  is  being  deployed  in  metropolitan 
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networks.  The  latest  standard  proposed  by  the  Full  Spectrum  CWDM  Alliance  is  a 
channel  spacing  T  =  20nm  (~  3600GHz)  starting  from  1271nm  up  to  161  lnm.  The 
wider  channel  spacing  (~  72  times  greater)  allows  lower  tolerances  on  the  lasers,  and 
allows  them  to  operate  without  temperature  control. 

In  any  case,  it  is  important  to  have  a  link  synthesis  algorithm  that  can  conform  to 
fanout  constraints.  Our  GA  is  able  to  incorporate  these  constraints  in  a  straightfor¬ 
ward  manner  by  implementing  the  initialization,  crossover,  and  mutation  operators  as 
described  below. 

7.2.4  Crossover  and  Mutation  Operators 

We  first  note  that  if  an  individual  topology  is  represented  as  a  binary  string  as  in  equation 
|77Z|,  then  the  typical  crossover  operations  like  array  one-point  crossover  (Figure  [773])  or 
two-point  crossover  will  not  preserve  the  fanout  constraint.  This  is  illustrated  in  Figure 
[771]  where  both  parents  obey  a  fanout  constraint  /max  =  2,  but  processor  0  of  child  X 
has  fanout  fox  =  3.  This  is  because  the  crossover  point  can  be  chosen  at  any  point. 
If  we  instead  choose  to  represent  the  topology  by  the  vector  representation  of  Equation 
[73],  fanout  constraints  are  preserved  in  the  crossover  operation,  since  the  link  vectors 
for  individual  processors  lt  are  never  altered.  The  crossover  operation  only  rearranges 
the  relative  position  of  these  link  vectors.  This  is  illustrated  in  Figure  [73]. 

We  also  must  ensure  that  the  initial  population  obeys  the  link  constraint.  The  initial¬ 
ization  operator  generates  random  processor  link  vectors  which  each  satisfy  the  fanout 
constraint  Equation  [73].  N  —  1  of  these  vectors  are  then  concatenated  to  form  the  link 
vector. 

The  mutation  operator  simply  chooses  a  random  bit  in  the  link  vector,  and  sets  its 
value  to  zero.  This  removes  a  link  if  one  existed  at  this  point.  Since  the  mutation 
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l0A  =  (1,1.0)  foA  =  2 
fix  =  (  0,0,1)  flA  —  1 
lli  =  (0,1,1)  /2X  =  2 

fax  =  (0,0,0)  fsA  =  0 


Iqb  =  (1,0, 1)  /os  =  2 
llB  =  (o,  1,  0)  fiB  =  1 
^2B  =  (o,  0,  1)  /2B  =  1 

I3B  =  (1,0,0)  f3B  =  1 
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lox  =  (1,  1,  1) 
fix  =  (0,1,0) 
fix  =  (0,0,1) 
hx  =  (1, 0, 0) 


fox  =  3 
fix  =  1 
hx  =  1 
hx  =  1 


Y 


Iqy  =  (1,0,0)  f0y  =  1 

llY  =  (0,  0,  1)  flY  =  1 

I2Y  =  (0, 1, 1)  hy  =  2 
I3Y  =  (0,0,0)  f3y  =  0 


Figure  7.4:  Crossover  operation  for  link  synthesis  using  the  binary  string  representation 
Equation  [7721  Link  fanout  constraint  is  not  preserved  for  child  X,  where  the  fanout  of 
processor  0  is  fox  =  3. 
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hx  =  (1,1,0)  fox  =  2 
hx  =  (0, 0. 1)  fix  =  1 
hx  =  (0, 0, 1)  fox  =  1 
hx  =  (1,0, 0)  fox  =  1 


Iqy  =  (1, 0, 1)  foY  =  2 
fiy  =  (0,1,0)  fiY  —  1 
^2y  =  (0, 1, 1)  /2y  =  2 

W  =  (0, 0, 0)  foY  =  0 


Y 


Figure  7.5:  Crossover  operation  for  link  synthesis  using  the  vector  representation  Equa¬ 
tion  [773].  The  fanout  constraint  /max  =  2  is  preserved  in  the  children. 
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operator  only  removes  links,  the  fanout  constraint  is  preserved. 


7.2.5  Experiments 

We  evaluated  our  GA-based  interconnection  synthesis  algorithm  on  the  RBFNN  appli¬ 
cation  discussed  above.  We  compared  the  GA-based  algorithm  to  the  TPLA  algorithm. 
The  genetic  algorithm  has  several  advantages  over  the  TPLA  algorithm.  The  first  advan¬ 
tage  is  that  it  is  able  to  incorporate  fanout  constraints,  which  the  TPLA  algorithm  does 
not.  Cost  and  area  considerations  often  dictate  fanout  constraints.  In  a  free-space  optical 
system,  as  already  mentioned,  fanout  is  dictated  by  the  number  of  VCSELs  and  photore¬ 
ceivers  that  can  be  placed  adjacent  to  a  processor.  In  a  WDM  system,  cost  constraints 
dictate  the  number  of  wavelengths  used.  The  second  advantage  is  that,  in  order  to  syn¬ 
thesize  a  network  for  a  given  link  constraint,  the  TPLA  must  evaluate  many  intermediate 
topologies  that  do  not  meet  the  link  constraint  during  its  construction  phases.  This  makes 
it  much  less  efficient,  especially  for  systems  with  a  large  number  of  processors.  Neither 
of  these  algorithms  take  into  account  isomorphically  unique  link  topologies,  which  is 
the  subject  of  the  following  section.  Figure  [775|  shows  the  best  latency  achieved  for  each 
level  of  connectivity  between  zero  connectivity  and  fully  connected  for  both  algorithms. 
This  gives  a  Pareto  curve  of  the  trade-off  between  number  of  links  and  latency  for  the 
application.  In  order  to  properly  compare  the  different  algorithms,  the  GA  run  time  was 
limited  to  the  run  time  required  by  TPLA.  The  results  show  that  the  algorithm  based 
on  the  GA  performs  21%  better  (producing  lower  makespan  schedules),  when  averaged 
over  the  different  link  configurations,  for  this  benchmark. 
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neural  net  (8  input  -  2  hidden  -  4  output)  on  10  processors 


Figure  7.6:  Comparison  of  TPLA  and  genetic  algorithm  for  neural  network  application. 
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(a) 


(b) 


Figure  7.7:  Example  of  two  isomorphic  graphs. 


7.3  Using  Graph  Isomorphism 


If  we  consider  systems  in  which  all  the  processors  are  identical  (homogeneous  processor 
set),  then  we  can  pare  the  design  space  significantly  if  we  only  consider  isomorphically 
unique  topology  graphs.Two  graphs  G  =  (V,  E)  and  G' {V ,  E')  are  isomorphic  if  we 
can  relabel  the  vertices  of  G  to  be  vertices  of  G',  maintaining  the  corresponding  edges 
in  G  and  G'.  For  example,  the  graphs  in  Figures  [7.7(a)|  and  |7.7(b)|  are  isomorphic  with 
the  vertices  relabelled  as  follows:  1  — >  a,  2  — >  b,  3  — >  c,  and  4  — >  d. 

Consider  a  topology  graph  G  with  E  edges  and  N  nodes  where  each  node  corre¬ 
sponds  to  a  processor  and  each  edge  corresponds  to  a  link  between  two  processors. 
The  maximum  number  of  edges  in  G  is  Em ax  =  Ar(Ar  —  1)  corresponding  to  a  fully 
connected  graph  (full  crossbar  interconnect).  If  all  links  are  bidirectional,  the  topology 
graph  is  undirected  and  Em:ix  =  N(N  —  l)/2.  We  can  represent  the  graphs  with  either 
an  adjacency  list  or  adjacency  matrix  and  label  each  different  representation.  Then  for  a 
graph  with  E  edges  the  number  of  different  labellings  is  given  by 


ng  = 


E, 


max 

E 


En 


N{N  —  1)! 


E\(Em^-E)\  /-;!(.V(.V-  1) -/-;)! 


(7.8) 
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which  increases  exponentially  with  N.  The  maximum  value  of  ng  occurs  at  E  = 
l?max / 2 .  However,  the  number  of  isomorphically  unique  graphs  nunique  is  much  less 
than  rig.  For  very  small  N,  we  can  enumerate  the  different  possibilities  to  show  this. 
Figure  [7TE]  depicts  the  different  isomorphic  graphs  for  N  =  4  processors  and  E  =  3 
bidirectional  links.  There  are  20  different  graph  labellings,  but  we  observe  that  most  are 
isomorphic — only  3  are  isomorphically  unique. 

For  larger  N,  ng  increases  rapidly  according  to  Equation  [/.8|.  We  enumerated  the 
possibilities  and  tested  for  isomorphism  for  N  =  5  and  N  =  6  using  Brendan  McKays’s 
nauty  program  [CEO,  which  is  currently  the  fastest  published  graph  isomorphism  testing 
program.  The  results  are  shown  in  Figure  [733]  For  N  =  6  and  E  =  12  we  observe 
that  there  is  a  3  order  magnitude  difference  between  the  ng  and  nunique.  Also,  this  ratio 
increases  with  ng.  We  would  like  to  exploit  this  property  to  pare  the  design  space  for 
link  synthesis. 

Since  ng  is  so  large  it  is  impractical  to  compute  and  store  the  isomorphic  graphs 
in  advance.  Rather,  we  employ  an  on-line  isomorphic  test  in  order  to  speed  up  our 
deterministic  algorithm.  This  is  illustrated  in  Figure  |/.1U|.  In  this  algorithm  we  store 
the  topology  graphs,  and  schedule  them  only  if  they  are  not  isomorphic  with  another 
topology  graph  previously  evaluated.  We  begin  with  a  connected  graph  G  \  with  e  = 
N  —  1  edges,  and  define  a  set  S  of  evaluated  graphs.  Initially,  S  =  G\.  We  define  a 
parameter  jmax  which  corresponds  to  the  maximum  number  of  graphs  we  will  consider 
at  a  given  e,  and  a  parameter  Gbestie  which  corresponds  to  the  best  graph  with  e  edges. 

We  construct  graph  G2  by  adding  an  edge  to  G i .  At  this  step,  there  are  N(N  —  1)  —  e 
possible  edges  to  add.  If  G2  is  not  isomorphic  with  a  graph  in  S,  we  set  S  =  S  U  {G2} 
and  schedule  G2  using  a  combination  of  the  DLS  scheduling  algorithm  [U7lj  and  the 
flexibility  algorithm  as  described  in  Chapter  If  the  throughput  using  G2  is  higher  than 
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N  —  4  processors,  all  links  bidirectional 


^max  —  6 


E 

6 

ng 

1 

^-unique 
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2 

A  A 
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Figure  7.8:  Isomorphically  unique  graphs  containing  E  edges  for  iV  =  4  processors. 
Here  we  only  consider  undirected  graphs  representing  bidirectional  links  in  order  to 
make  the  figure  clearer. 
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5  nodes 


(a)  N  =  5 


6  nodes 


(b)  N  =  6 

Figure  7.9:  A  comparison  of  the  number  of  possible  graph  labellings  ng  given  by  Equa¬ 
tion  [77S]  with  the  number  of  these  graphs  that  are  isomorphically  unique. 
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the  throughput  using  Gbest,e  then  we  replace  G*best,e-  Next  we  construct  a  graph  G:i  by 
removing  an  edge  from  G2.  If  G.>,  is  not  isomorphic  with  a  graph  in  S  we  schedule  G:>, 
set  S  =  S'U  {G;s},  and  update  G*best,e-  If  G 3  =  G\  then  the  algorithm  is  “stuck” — further 
combinations  of  adding  and  removing  edges  will  produce  graphs  already  evaluated.  At 
this  point  we  construct  G\  by  adding  an  edge  to  Gbest.e  and  repeat  the  above  process  by 
constructing  a  new  G2  (with  one  more  edge  this  iteration)  by  adding  an  edge  to  G\.  We 
continue  until  either  the  throughput  constraint  is  met  and  the  algorithm  is  successful, 
or  until  e  =  emax  and  the  algorithm  fails  to  meet  the  throughput  constraints  with  emax 
edges. 

The  graph  isomorphism  test  speeds  up  the  deterministic  link  synthesis  algorithm 
only  if  isomorphism  testing  of  a  topology  graph  is  faster  than  the  scheduling  the  appli¬ 
cation  on  the  graph.  The  complexity  of  the  graph  isomorphism  algorithm  is  still  an  open 
problem — there  exists  no  known  P  algorithm  for  graph  isomorphism  testing,  although 
the  problem  has  also  not  been  shown  to  be  NP-complete.  It  is  thought  that  the  prob¬ 
lem  falls  in  the  area  between  P  and  NP-complete,  if  such  an  area  exists  [FJ91] .  However, 
McKay’s  nauty  [T78I]  program  has  been  proven  to  be  very  efficient  in  practice.  Although 
its  worst  case  run  time  is  exponential  [EH],  an  empirical  test  of  a  large  number  of  ran¬ 
domly  generated  graphs  produced  run  times  of  1.2 p2  ns  on  a  1  GHz  Pentium  III  machine 
where  p  is  the  number  of  nodes  in  the  graph  [17  Kl]  (p  equals  the  number  of  processors  in 
our  case).  By  comparison,  the  DLS  scheduling  algorithm  has  complexity  0(v3p )  where 
v  is  the  number  of  nodes  in  the  task  graph  [971] .  We  modify  the  DLS  scheduling  algo¬ 
rithm  by  adding  a  flexibility  calculation  at  each  scheduling  step.  The  complexity  of  the 
flexibility  algorithm  (Equation  |Q1)  is  0(y(y  +  e)p  logp)  where  e  is  the  number  of  edges 
in  the  graph,  so  the  overall  complexity  scheduling  an  arbitrary  graph  using  the  modified 
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DLS  scheduling  algorithm  is 


0(vA(v  +  e)p 2  log p).  (7.9) 

The  number  of  tasks  in  the  application  will  be  much  greater  than  the  number  of  pro¬ 
cessors  in  practice,  so  v  »  p  and  e  »  p.  For  randomly  generated  graphs,  the  nauty 
program  is  therefore  much  faster  than  the  modified  DLS  scheduling  algorithm  and  we 
achieve  significant  speedup  by  detecting  and  exploiting  graph  isomorphism. 
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Chapter  8 


Design  Space  Exploration  Using  Simulated 
Heating 

8.1  Introduction 

Application-specific,  parameterized  local  search  algorithms  (PLSAs),  in  which  opti¬ 
mization  accuracy  can  be  traded  off  with  run-time,  arise  naturally  in  many  optimization 
contexts,  including  most  of  the  optimzation  problems  discussed  in  this  thesis.  For  many 
problems  in  system  design,  the  user  wishes  to  first  quickly  evaluate  many  trade-offs  in 
the  system,  often  in  an  interactive  environment,  and  then  to  refine  a  few  of  the  best  de¬ 
sign  points  as  thoroughly  as  possible.  Often,  an  exact  system  simulation  may  take  days 
or  weeks.  In  this  context,  it  is  quite  useful  to  have  optimization  techniques  where  the 
run-time  can  be  controlled,  and  which  will  generate  a  solution  of  maximum  quality  in 
the  allotted  time. 

In  this  chapter  we  introduce  a  novel  approach,  which  we  call  simulated  heating, 
for  systematically  integrating  parameterized  local  search  into  evolutionary  algorithms 
(EAs).  Using  the  framework  of  simulated  heating,  we  investigate  both  static  and  dy¬ 
namic  strategies  for  systematically  managing  the  trade-off  between  PLS  A  accuracy  and 
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optimization  effort.  Our  goal  is  to  achieve  maximum  solution  quality  within  a  fixed 
optimization  time  budget. 

We  show  that  the  simulated  heating  technique  better  utilizes  the  given  optimization 
time  resources  than  standard  hybrid  methods  that  employ  fixed  parameters,  and  that  the 
technique  is  less  sensitive  to  these  parameter  settings.  In  Chapter  0  we  apply  this  frame¬ 
work  to  the  voltage  scaling  optimization  problem  discussed  in  Section  fOj  a  memory 
cost  minimization  problem  for  embedded  systems,  and  the  well-known  binary  knapsack 
problem.  We  compare  our  results  to  the  standard  hybrid  methods,  and  show  quantita¬ 
tively  that  careful  management  of  this  trade-off  is  necessary  to  achieve  the  full  potential 
of  an  EA/PLSA  combination.  We  also  explain  how  simulated  heating  could  be  used  for 
the  interconnect  synthesis  problem  and  for  the  problem  of  finding  optimal  transaction 
orders.  Demonstrating  the  use  of  simulated  heating  on  these  last  two  problems  is  the 
subject  of  future  work. 

For  many  optimization  problems,  efficient  algorithms  exist  for  refining  arbitrary 
points  in  the  search  space  into  better  solutions.  Such  algorithms  are  called  local  search 
algorithms  because  they  define  neighborhoods,  typically  based  on  initial  “coarse”  solu¬ 
tions,  in  which  to  search  for  optima.  Many  of  these  algorithms  are  parameterizable  in 
nature.  Based  on  the  values  of  one  or  more  algorithm  parameters,  such  a  parameterized 
local  search  algorithm  (PLSA)  can  trade  off  time  or  space  complexity  for  optimization 
accuracy. 

PLSAs  and  evolutionary  algorithms  (EAs)  have  complementary  advantages.  EAs 
are  applicable  to  a  wide  range  of  problems,  they  are  robust,  and  are  designed  to  sample 
a  large  search  space  without  getting  stuck  at  local  optima.  Problem- specific  PLSAs  are 
often  able  to  converge  rapidly  toward  local  minima.  The  term  ‘local  search’  generally 
applies  to  methods  that  cannot  escape  these  minima.  For  these  reasons,  PLSAs  can  be 
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incorporated  into  EAs  in  order  to  increase  the  efficiency  of  the  optimization. 

Several  techniques  for  incorporating  local  search  have  been  reported.  These  include 
Genetic  Local  Search  [EDA,  Genetic  Hybrids  [SO],  Random  Multi-Start  [ED],  GRASP  [HEQ, 
and  others.  These  techniques  are  often  demonstrated  on  well-known  problem  instances 
where  either  optimal  or  near-optimal  solutions  are  known.  The  optimization  goal  of 
these  techniques  is  then  to  obtain  a  solution  very  close  to  the  optimum  with  acceptable 
run-time.  In  this  regard,  the  incorporation  of  local  search  has  been  quite  successful. 
For  example,  Vasquez  and  Whitley  [11061]  demonstrated  results  within  0.75%  of  the  best 
known  results  for  the  Quadratic  Assignment  Problem  using  a  hybrid  approach,  with  all 
run  times  under  five  hours.  In  most  of  these  hybrid  techniques  the  local  search  is  run 
with  fixed  parameter  values  (i.e.  at  the  highest  accuracy  setting). 

In  this  thesis,  we  consider  a  different  optimization  goal,  which  has  not  been  ad¬ 
dressed  so  far.  Here  we  are  interested  in  generating  a  solution  of  maximum  quality 
within  a  specified  optimization  time,  where  the  optimization  run  time  is  an  important 
constraint  that  must  be  obeyed.  Such  a  fixed  optimization  time  budget  is  a  realistic 
assumption  in  practical  optimization  scenarios.  Many  such  scenarios  arise  in  the  de¬ 
sign  of  embedded  systems.  In  a  typical  design  process,  the  designer  begins  with  only 
a  rough  idea  of  the  system  architecture,  and  first  needs  to  assess  the  effects  of  a  large 
number  of  design  choices — different  component  parts,  memory  sizes,  different  software 
implementations,  etc.  Since  the  time  to  market  is  very  critical  in  the  embedded  system 
business,  the  design  process  is  on  a  strict  schedule.  In  the  first  phases  of  the  design 
process,  it  is  essential  to  get  good  estimates  quickly  so  that  these  initial  choices  can  be 
made.  Later,  as  the  design  process  converges  on  a  specific  hardware/software  solution, 
it  is  important  to  get  more  accurate  solutions.  In  these  cases,  the  designer  needs  to  have 
the  run  time  as  an  input  to  the  optimization  problem. 
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In  order  to  accomplish  this  goal,  we  vary  the  parameters  of  the  local  search  during 
the  optimization  process  in  order  to  trade  off  accuracy  for  reduced  complexity.  Our 
optimization  approach  is  general  enough  to  hold  for  any  kind  of  global  search  algorithm; 
however,  in  this  paper  we  test  hybrid  solutions  that  solely  use  an  EA  as  the  global  search 
algorithm.  Existing  hybrid  techniques  fix  the  local  search  at  a  single  point,  typically  at 
the  highest  accuracy.  In  the  following  discussion  and  experiments,  we  refer  to  this 
method  as  a  fixed  parameter  method.  We  will  compare  our  results  against  this  method. 

One  of  the  central  issues  we  examine  is  how  the  computation  time  for  the  PLSA 
should  be  allocated  during  the  course  of  the  optimization.  More  time  allotted  to  each 
PLSA  invocation  implies  more  thorough  local  optimization  at  the  expense  of  a  smaller 
number  of  achievable  function  evaluations  (e.g.,  smaller  numbers  of  generations  ex¬ 
plored  with  evolutionary  methods),  and  vice-versa.  Arbitrary  management  of  this  trade¬ 
off  between  accuracy  and  run  time  of  the  PLSA  is  not  likely  to  generate  optimal  re¬ 
sults.  Furthermore,  the  proportion  of  time  that  should  be  allocated  to  each  call  of  the 
local  search  procedure  is  likely  to  be  highly  problem-specific  and  even  instance-specific. 
Thus,  dynamic  adaptive  approaches  may  be  more  desirable  than  static  approaches. 

In  this  thesis,  we  describe  a  technique  called  simulated  heating  [IQ,  ODD],  which 
systematically  incorporates  parameterized  local  search  into  the  framework  of  global 
search.  The  idea  is  to  increase  the  time  allotted  to  each  PLSA  invocation  during  the 
optimization  process — low  accuracy  of  the  PLSA  at  the  beginning  and  high  accuracy 
at  the  endQ.  This  is  in  contrast  to  most  existing  hybrid  techniques,  which  consider  a 
fixed  local  search  function,  usually  operating  at  the  highest  accuracy.  Within  the  context 
of  simulated  heating  optimization,  we  consider  both  static  and  dynamic  strategies  for 

'in  contrast  to  [11 1  111].  the  time  budget  here  refers  to  the  overall  GSA/PLSA  hybrid,  not  only  the  time 
resources  needed  by  the  PLSA. 
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systematically  increasing  the  PLSA  accuracy  and  the  corresponding  optimization  effort. 
Our  goals  are  to  show  that  careful  management  of  this  trade-off  is  necessary  to  achieve 
the  full  potential  of  an  EA/PLSA  combination,  and  to  develop  an  efficient  strategy  for 
achieving  this  trade-off  management.  We  show  that,  in  the  context  of  a  fixed  optimiza¬ 
tion  time  budget,  the  simulated  heating  technique  performs  better  than  using  a  fixed 
local  search. 

In  most  heuristic  optimization  techniques,  there  are  some  parameters  that  must  be  set 
by  the  user.  In  many  cases,  there  are  no  clear  guidelines  on  how  to  set  these  parameters. 
Moreover,  the  optimal  parameters  are  often  dependent  on  the  exact  problem  specifica¬ 
tion.  We  show  that  the  simulated  heating  technique,  while  still  requiring  parameters  to 
be  set  by  the  user,  is  less  sensitive  to  the  parameter  settings. 

First  we  will  outline  PLSAs  for  three  of  the  optimization  problems  covered  in  this 
thesis. 

8.1.1  PLSA  for  Voltage  Scaling 

Background 

Dynamic  voltage  scaling  0230  in  microprocessors  is  an  important  advancing  technology. 
It  allows  the  average  power  consumption  in  a  device  to  be  reduced  by  slowing  down 
(by  lowering  the  voltage)  some  tasks  in  the  application.  Here  we  will  assume  that  the 
application  is  specified  as  a  dataflow  graph.  We  are  given  a  schedule  (ordering  of  tasks 
on  the  processors)  and  a  constraint  on  the  throughput  of  the  system.  We  wish  to  find  a 
set  of  voltages  for  all  the  tasks  that  will  minimize  the  average  power  of  the  system  while 
satisfying  the  throughput  constraint.  The  only  way  to  compute  the  throughput  exactly  in 
these  systems  is  via  a  full  system  simulation.  However,  simulation  is  computationally 
intensive  and  we  would  like  to  minimize  the  number  of  simulations  required  during 
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synthesis.  We  have  previously  demonstrated  that  a  data  structure,  called  the  period 
graph,  can  be  used  as  an  efficient  estimator  for  the  system  throughput  [0]  and  thus 
reduce  the  number  of  simulations  required. 

Using  the  Period  Graph  for  Local  Search 

As  explained  in  [0]  and  in  Chapter  g,  we  can  estimate  the  throughput  of  the  system 
as  voltage  levels  are  changed  by  calculating  the  maximum  cycle  mean  f]  (MCM)  [16 hi] 
of  the  period  graph.  In  order  to  construct  the  period  graph,  we  must  perform  one  full 
system  simulation  at  an  initial  point — after  the  period  graph  is  constructed  we  may  use 
the  MCM  estimate  without  re-simulating  the  system.  It  is  shown  in  [0]  that  the  MCM 
of  the  period  graph  is  an  accurate  estimate  for  the  throughput  if  the  task  execution  times 
are  varied  around  a  limited  region  (local  search),  and  that  the  quality  of  the  estimate 
increases  as  the  size  of  this  region  decreases.  A  variety  of  efficient,  low  polynomial¬ 
time  algorithms  have  been  developed  for  computing  the  maximum  cycle  mean  (e.g., 
see  D29]). 

We  can  use  the  size  of  the  local  search  neighborhood  as  the  parameter  p  in  a  pa¬ 
rameterized  local  search  algorithm  (PLSA).  We  call  this  parameter  the  re-simulation 
threshold  (r),  and  define  it  as  the  vector  distance  between  a  candidate  point  (vector  of 
voltages)  and  the  voltage  vector  V  from  which  the  period  graph  was  constructed.  To 
search  around  a  given  point  V  in  the  design  space,  we  must  simulate  once  and  build  the 
period  graph.  Then,  as  long  as  the  local  search  points  are  within  a  distance  r  from  V,  we 
can  use  the  (efficient)  period  graph  estimate.  For  points  outside  r,  we  must  re-simulate 

2Here  the  maximum  cycle  mean  is  the  maximum,  over  all  directed  cycles  of  the  period  graph,  of  the 
sum  of  the  task  execution  times  on  a  cycle  divided  by  the  sum  of  the  edge  delays  (initial  tokens)  on  a 
cycle. 
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and  rebuild  the  period  graph.  Consequently,  there  is  a  trade-off  between  speed  and  ac¬ 
curacy  for  r — as  r  decreases,  the  period  graph  estimate  is  more  accurate,  but  the  local 
search  is  slower  since  simulation  is  performed  more  often. 

PLSA  Implementation 

To  solve  the  dynamic  voltage  scaling  optimization  problem  we  use  a  GSA/PLSA  hybrid 
where  an  evolutionary  algorithm  is  the  GSA  and  the  PLSA  is  either  a  hill  climbing  [KS4IJ 
or  Monte  Carlo  [ri71]  search  utilizing  the  period  graph.  Pseudo-code  for  both  local  search 
methods  is  shown  in  Figures  [HTT]  and  fS.4  The  benefit  of  using  a  local  search  algorithm 
is  that  within  a  restricted  voltage  range  we  can  use  the  period  graph  estimator  for  the 
throughput,  which  is  much  faster  than  performing  a  simulation.  The  local  search  algo¬ 
rithms  are  explained  further  below. 

Voltage  Scaling  PLSA  1:  Hill  Climb  Local  Search 

For  the  hill  climbing  algorithm,  we  defined  a  parameter  S,  which  is  the  voltage  step,  and 
a  re-simulation  threshold  r,  which  is  the  maximum  amount  that  the  voltage  vector  can 
vary  from  the  point  at  which  the  period  graph  was  calculated.  We  ran  the  algorithm  for 
/  iterations.  So  for  this  case,  the  PLSA  L  had  3  parameters  /,  r,  and  5.  One  iteration  of 
local  search  consisted  of  changing  the  node  voltages,  one  at  a  time,  by  ±5,  and  choosing 
the  direction  in  which  the  objective  function  was  minimized.  From  this,  the  worst  case 
cost  C(I.  r,  5)  for  I  iterations  would  correspond  to  evaluating  the  objective  function  31 
times,  and  re-simulating  (I/\r/§~\)  times.  For  our  experiments  we  fixed  /  and  5  and 
defined  the  local  search  parameter  as  p  =  1/r.  Then  for  smaller  p  (corresponding  to 
larger  re-simulation  threshold)  the  voltage  vector  can  move  a  greater  distance  before  a 
new  simulation  is  required.  For  a  fixed  number  of  iterations  /  in  the  local  search,  a 
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Algorithm  8.1:  Hill  Climb  Local  Search( 
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f  LowScore  <—  score 

l  Horn  -  v 


then 


Figure  8.1:  Pseudo-code  for  hill  climb  local  search  for  voltage  scaling  application. 


148 


Algorithm  8.2:  Monte  Carlo  Local  Search( 


Vjn,  N,  R,  G,  D,  ^-gsim’  ^out>  score 


) 

input:  voltage  vector  Vjn  of  size  N 

input:  number  of  random  points  generated  R 

input:  D  is  maximum  distance  from  V-m  to  random  point 

input:  period  graph  G  with  N  tasks 

input:  objective  function  P0pj  derived  from  maximum  cycle  mean  of  G  scaled  by  V 
input:  5resjm  is  the  resimulation  threshold  distance 
output:  voltage  vector  l  out 
output:  score 

Generate  a  list  Lrancj  of  R  random  vectors  uniformly  distributed  within 
a  distance  no  more  than  D  from  Vm 

V*  -  ^n 

score  oo 
for  (i  =  1  to  R) 
do  {qr  <-\\Vr-  V{n\\ 
while  (Lranci  not  empty  ) 

Pop  head  of  list  Lrancj  to  get  V 
Scale  G  by  V 

q<-\\vx-  v\\ 

if  (d  <  ^resim) 

f^Fohj(V,G) 
then  <  if  (/  <  score) 

[  then  {score  / 

fvx<-v 

Resimulate  around  V  and  rebuild  G 
else  for  (r  =  1  to  size(Lran(j)) 
do  { qr  <  \\Vr  Va;|| 

.Sort  Lra nd  according  to  lowest  q  first 


Figure  8.2:  Pseudo-code  for  Monte  Carlo  local  search  for  voltage  scaling  application. 
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smaller  p  corresponds  to  a  shorter  running  time  C (p)  for  L {'[>)■  The  accuracy  A(p)  is 
lower,  since  the  accuracy  of  the  period  graph  estimate  decreases  as  the  voltage  vector 
moves  farther  away  from  the  simulation  point. 

Voltage  Scaling  PLSA  2:  Monte  Carlo  Local  Search 

In  the  Monte  Carlo  algorithm,  we  generated  N  random  voltage  vectors  within  a  dis¬ 
tance  D  from  the  input  vector.  For  all  points  within  a  re-simulation  threshold  r,  we  used 
the  period  graph  to  estimate  performance.  A  greedy  strategy  was  used  to  evaluate  the 
remaining  points.  Specifically,  we  selected  one  of  the  remaining  points  at  random,  per¬ 
formed  a  simulation  to  construct  a  new  period  graph,  and  used  the  resulting  estimator 
to  evaluate  all  points  within  a  distance  r  from  this  point.  If  there  were  points  remaining 
after  this,  we  chose  one  of  these  and  repeated  the  process.  For  the  experiments  we  fixed 
N  and  D  and  defined  local  search  parameter  p  —  l/r.  As  for  the  hill  climbing  local 
search,  smaller  values  of  p  correspond  to  shorter  run  times  and  less  accuracy  for  the 
Monte  Carlo  local  search. 

8.1.2  PLSA  for  Interconnect  Synthesis 

In  Section  [77T]  we  described  a  greedy  heuristic  algorithm,  called  the  TPLA  algorithm  [H], 
to  synthesize  an  interconnect  and  an  associated  multiprocessor  schedule  for  a  given 
application.  Here  we  will  describe  how  this  algorithm  can  be  parameterized  so  that  it 
can  be  used  as  a  PLSA  for  simulated  heating. 

The  TPLA  algorithm  starts  with  a  fully  connected  network,  and  operates  in  down 
and  up  phases.  Each  step  of  the  down  phase  in  TPLA  removes  one  link,  while  each  step 
of  the  up  phase  adds  one  link.  We  can  modify  this  basic  idea  to  create  a  parameterized 
local  search  for  interconnect  synthesis.  The  input  to  the  local  search  is  a  processor 
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Remove  (A,B)  and  (B,C) 
Add  (B,A)  and  (C,D) 


Figure  8.3:  PLSA  for  interconnect  synthesis.  The  output  topology  graph  is  one  possible 
topology  generated  from  the  input  topology  graph  with  p  =  2. 

topology  graph  (Section  |6.1.1|)  T(<f>,  L )  with  0  processors  (nodes  in  the  topology  graph) 
and  l  links  (edges  in  the  topology  graph).  We  first  remove  p  links,  where  p  <  l.  There 
are  d  =  (fj  possible  choices.  Next  we  add  back  p  links  to  achieve  a  new  topology  with 
l  links.  This  effectively  creates  a  set  of  topologies  in  a  local  region  around  the  input 
topology.  Figure  |0]  illustrates  this  concept. 

There  are  0(0  —  1)  —  (/  —  p)  positions  where  a  link  may  be  added  where  one  does 
not  already  exist,  so  there  are  u  =  possible  choices  for  adding  back  the 

p  links.  The  total  number  of  combinations  for  first  removing  p  links  then  adding  back 
p  links  is  the  product  of  u  and  d.  Many  of  these  topologies  may  be  isomorphic  to  one 
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another,  so  the  online  graph  isomorphism  test  (Section  [73p  can  be  used  to  avoid  evalu¬ 
ating  isomorphic  topologies.  Each  time  an  isomorphically  unique  topology  is  created,  it 
is  evaluated  using  a  scheduling  algorithm  utilizing  the  feasibility  and  flexibility  metrics 
(Chapter  §),  and  the  topology  with  the  best  schedule  is  chosen  as  the  output  of  the  local 
search. 

The  number  of  topologies  generated  (the  local  search  complexity)  is  a  rapidly  in¬ 
creasing  function  of  p,  and  p  can  be  used  as  the  PLSA  parameter  p.  It  is  also  possible  to 
place  a  limit  on  the  number  of  combinations  of  p  links  removed  (dmax)  and  the  number 
of  combinations  of  p  links  added  back  (wmax).  In  this  case  the  local  search  parameter 
P  =  ^max^max  can  be  adjusted  so  that  the  local  search  complexity  does  not  increase  so 
fast  with  increasing  p. 


8.1.3  PLSA  for  Ordered  Transactions 


The  ordered  transactions  strategy  was  covered  in  Chapter  |5],  where  it  was  shown  that  the 
problem  of  finding  optimal  transaction  orders  is  NP-complete.  In  this  section,  we  outline 
how  a  PLSA  could  be  constructed  for  this  problem.  Recall  that  the  ordered  transactions 
graph  (Section  |5.1.3|)  T(GIPC,  O)  is  created  from  an  IPC  graph  Gipc  and  a  transaction 
order  O,  and  that  the  MCM  of  T  gives  the  throughput  of  the  system.  A  PLSA  for  the 
ordered  transactions  problem  takes  an  input  ordering  Oin  and  evaluates  permutations 
around  Oin  to  produce  a  better  ordering  Oout.  The  permutation  method  we  propose  is 
a  pair  swap — we  swap  the  positions  of  a  pair  of  nodes  in  the  transaction  ordering.  If 
the  swapping  does  not  create  any  zero-delay  cycles,  we  can  calculate  the  MCM  of  the 
new  ordered  transaction  graph.  If  the  pair  swap  has  produced  a  lower  MCM,  we  keep 
this  ordering  and  attempt  to  swap  another  pair  of  nodes.  This  continues  for  a  number  p 
of  iterations,  where  p  is  the  local  search  parameter.  Pseudo-code  for  this  local  search  is 
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Algorithm  8.3:  Ordered  Transactions  Local  Search( 


Om ,  Gipc,  0out 


) 


input:  transaction  ordering  Om  of  length  n 

input:  IPC  graph  Gipc 

output:  new  transaction  ordering  0out 

i  <—  n 

Gout  Gjn 

bestScore  <—  MCM(r(Gipc,  Ojn)) 
count  <—  0 

while  (i  >  0)  A  (count  <  pn) 

1  =  -1 
i  =  i  —  1 

while  (j  <  i)  A  (count  <  pn) 
j  <-  3  +  1 
if  0out[*]  /  0out[j] 

'  temp  <-  0out[i] 

GoutW  G0utb1 
Goutbl  <-  temP 

if  zero-delay  cycles  in  L(Gipc,  Oout) 
then  {score  <—  oo 

else  {score  <— MCM(r(Gipc,  0out)) 
if  score  <  bestScore 
then  {bestScore  <—  score 
[ temp  <-  0out['i] 
else  ^  0out [i]  <-  Oom[j] 

[Goutbl  temp 


do 


do  < 


Figure  8.4:  Pseudo-code  for  PLSA  for  ordered  transactions  strategy, 
given  in  Figure  |Oj 

8.2  Hybrid  Global/Local  Search  Related  Work 

In  the  field  of  evolutionary  computation,  hybridization  seems  to  be  common  for  real- 
world  applications  |)PJ  and  many  evolutionary  algorithm/local  search  method  combi¬ 
nations  can  be  found  in  the  literature,  e.g.,  [_T(X  53,  ED,  92,  Mill].  Local  search  tech¬ 
niques  can  often  be  incorporated  naturally  into  evolutionary  algorithms  ( EAs )  in  order 
to  increase  the  effectiveness  of  optimization.  This  has  the  potential  to  exploit  the  corn- 
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plementary  advantages  of  EAs  (generality,  robustness,  global  search  efficiency),  and 
problem- specific  PLSAs  (exploiting  application-specific  problem  structure,  rapid  con¬ 
vergence  toward  local  minima).  Below  we  list  some  hybrid  methods  in  the  literature,  and 
suggest  how  they  could  potentially  be  adapted  to  use  our  simulated  heating  technique. 

One  problem  to  which  hybrid  approaches  have  been  successfully  applied  is  the 
quadratic  assignment  problem  (QAP),  which  is  an  important  combinatorial  problem. 
Several  groups  have  used  hybrid  genetic  algorithms  that  are  effective  is  solving  the  QAP. 
The  QAP  concerns  n  facilities,  which  must  be  assigned  to  n  locations  at  minimum  cost. 
The  problem  is  to  minimize  the  cost 

n  n 

C(7r)  =  71  G  n(n) 

i=  1  3= 1 

where  II (n)  is  a  set  of  all  permutations  of  {1,2,...,  77.},  atj  are  elements  of  a  distance 
matrix,  and  bij  are  elements  of  a  flow  matrix  representing  the  flow  of  materials  from 
facility  i  to  facility  j. 

Merz  and  Freisleben  [IHIJ  presented  a  Genetic  Local  Search  (GLS)  technique,  which 
applies  a  variant  of  the  2-opt  heuristic  as  a  local  search  technique.  For  the  QAP,  the  2-opt 
neighborhood  is  defined  as  the  set  of  all  solutions  that  can  be  reached  from  the  current 
solution  by  swapping  two  elements  of  the  permutation  n.  The  size  of  this  neighborhood 
increases  quadratically  with  n.  The  2-opt  local  search  employed  by  Merz  takes  the  first 
swap  that  reduces  the  total  cost  This  is  done  to  increase  efficiency. 

Fleurent  and  Ferland  [40J  combined  a  genetic  algorithm  with  a  local  Tabu  Search 
(TS)  method.  In  contrast  to  the  simpler  local  search  of  Merz,  the  idea  of  the  TS  is  to 
consider  all  possible  moves  from  the  current  solution  to  a  neighboring  solution.  Their 
method  is  called  Genetic  Hybrids.  They  improved  the  best  solutions  known  at  the  time 
for  most  large  scale  QAP  problems. 

By  comparison,  simulated  heating  for  QAP  might  be  formulated  as  a  combination 
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of  the  above  two  methods.  One  could  consider  the  best  of  m  moves  found  that  reduce 


where  m  is  the  PLSA  parameter. 

Vasquez  and  Whitley  [11  Obi]  also  presented  a  technique,  which  combines  a  genetic 
algorithm  with  TS,  where  the  genetic  algorithm  is  used  to  explore  in  parallel  several 
regions  of  the  search  space  and  uses  a  fixed  Tabu  local  search  to  improve  the  search 
around  some  selected  regions.  They  demonstrated  near  optimal  performance,  within 
0.75%  of  the  best  known  solutions.  They  did  not  investigate  their  technique  in  the 
context  of  a  fixed  optimization  time  budget. 

Random  multi-start  local  search  has  been  one  of  the  most  commonly  used  tech¬ 
niques  for  combinatorial  optimization  problems  [ED,  EH!].  In  this  technique,  a  number 
of  solutions  are  generated  randomly  at  each  step,  local  search  is  repeated  on  these  solu¬ 
tions,  and  the  best  solution  found  during  the  entire  optimization  is  output.  Several  im¬ 
provements  over  random  multi-start  have  been  described.  Greedy  randomized  adaptive 
search  procedures  (GRASP)  combine  the  power  of  greedy  heuristics,  randomization, 
and  conventional  local  search  procedures  [BSD  -  Each  GRASP  iteration  consists  of  two 
phases — a  construction  phase  and  a  local  search  phase.  During  the  construction  phase, 
each  element  is  selected  at  random  from  a  list  of  candidates  determined  by  an  adaptive 
greedy  algorithm.  The  size  of  this  list  is  restricted  by  parameters  a  and  j3,  where  a  is 
a  value  restriction  and  [3  is  a  cardinality  restriction.  Feo  et  al.  demonstrate  the  GRASP 
technique  on  a  single  machine  scheduling  problem  [El],  a  set  covering  problem,  and  a 
maximum  independent  set  problem  [BE!].  They  run  the  GRASP  for  several  fixed  values 
of  a  and  f3,  and  show  that  the  optimal  parameter  values  are  problem  dependent.  In  sim¬ 
ulated  heating,  a  and  j3  would  be  candidates  for  parameter  adaptation.  In  the  second 
phase  of  GRASP,  a  local  search  is  applied  to  the  constructed  solution  to  find  a  local 
optimum.  For  the  set  covering  problem,  Feo  et  al.  define  a  k,p  exchange  local  search 
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where  all  k-tuples  in  a  cover  are  exchanged  with  a  p-tuple.  Here,  k  was  fixed  during 
optimization.  In  a  simulated  heating  optimization,  k  might  be  used  as  the  PLSA  param¬ 
eter,  with  smaller  tuples  being  exchanged  at  the  beginning  of  the  optimization  and  larger 
tuples  examined  at  the  end.  A  similar  k-exchange  local  search  procedure  was  used  for 
the  maximum  independent  set  problem. 

Kazarlis  et  al.  [EDO  demonstrate  a  microgenetic  algorithm  (MGA)  as  a  generalized 
hill-climbing  operator.  The  MGA  is  a  GA  with  a  small  population  and  a  short  evolution. 
The  main  GA  performs  global  search  while  the  MGA  explores  a  neighborhood  of  the 
current  solution  provided  by  the  main  GA,  looking  for  better  solutions.  The  main  advan¬ 
tage  of  the  MGA  is  its  ability  to  identify  and  follow  narrow  ridges  of  arbitrary  direction 
leading  to  the  global  optimum.  Applied  to  simulated  heating,  MGA  could  be  used  as  the 
local  search  function  with  the  population  size  and  number  of  generations  used  as  PLSA 
parameters. 

He  and  Xu  [PPH]  describe  three  hybrid  genetic  algorithms  for  solving  linear  and  par¬ 
tial  differential  equations.  The  hybrid  algorithms  integrate  the  classical  successive  over 
relaxation  (SOR)  with  evolutionary  computation  techniques.  The  recombination  opera¬ 
tor  in  the  hybrid  algorithm  mixes  two  parents,  while  the  mutation  operator  is  equivalent 
to  one  iteration  of  the  SOR  method.  A  relaxation  parameter  u  for  the  SOR  is  adapted 
during  the  optimization.  He  and  Xu  observe  that  is  very  difficult  to  estimate  the  optimal 
u,  and  that  the  SOR  is  very  sensitive  to  this  parameter.  Their  hybrid  algorithm  does  not 
require  the  user  to  estimate  the  parameter;  rather,  it  is  evolved  during  the  optimization. 
Different  relaxation  factors  are  used  for  different  individuals  in  a  given  population.  The 
relaxation  factors  are  adapted  based  on  the  fitness  of  the  individuals.  By  contrast,  in 
simulated  heating  all  members  of  a  given  population  are  assigned  the  same  local  search 
parameter  at  a  given  point  in  the  optimization. 
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When  employing  PLSAs  in  the  context  of  many  optimization  scenarios,  however,  a 
critical  issue  is  how  to  use  computational  resources  most  efficiently  under  a  given  opti¬ 
mization  time  budget  (e.g.,  a  minute,  an  hour,  a  day,  etc.).  Goldberg  and  Yoessner  [MQ 
study  this  issue  in  the  context  of  a  fixed  local  search  time.  They  idealize  the  hybrid  as 
consisting  of  steps  performed  by  a  global  solver  G,  followed  by  steps  by  a  local  solver 
L,  and  a  search  space  as  consisting  of  basins  of  attraction  that  lead  to  acceptable  targets. 
Using  this,  they  are  able  to  decompose  the  problem  of  hybrid  search,  and  to  characterize 
the  optimum  local  search  time  that  maximizes  the  probability  of  achieving  a  solution  of 
a  specified  accuracy. 

Here,  we  consider  both  fixed  and  variable  local  search  time.  The  issue  of  how  to 
best  manage  computational  resources  under  a  fixed  time  budget  translates  into  a  prob¬ 
lem  of  appropriately  reconfiguring  successive  PLSA  invocations  to  achieve  appropriate 
accuracy/run-time  trade-offs  as  optimization  progresses. 

8.3  Simulated  Heating 

From  the  discussion  of  prior  work  we  see  that  one  weakness  of  many  existing  ap¬ 
proaches  is  their  sensitivity  to  parameter  settings.  Also,  excellent  results  have  been 
achieved  through  hybrid  global/local  optimization  techniques,  but  they  have  not  been 
examined  carefully  for  a  fixed  optimization  time  budget.  In  the  context  of  a  limited  time 
budget,  we  are  especially  interested  in  minimizing  wasted  time.  One  obvious  place  to 
focus  is  at  the  beginning  of  the  optimization,  where  many  of  the  candidate  solutions 
generated  by  the  global  search  are  of  poor  quality.  Intuitively,  one  would  want  to  evalu¬ 
ate  these  initial  solutions  quickly  and  not  spend  too  much  time  on  the  local  search.  Also, 
it  is  desirable  to  reduce  the  number  of  trial  runs  required  to  find  an  optimal  parameter 
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setting.  One  way  to  do  this  is  to  require  only  that  a  good  range  for  the  parameter  be 
given.  These  considerations  lead  to  the  idea  of  simulated  heating. 

8.3.1  Basic  Principles 

A  general  single  objective  optimization  problem  can  be  described  as  an  objective  func¬ 
tion  /  that  maps  a  tuple  of  m  parameters  (decision  variables)  to  a  single  objective 
y.  Formally,  we  wish  to  either  minimize  or  maximize  y  =  f(x)  subject  to  x  = 
(xi,  x2,  ■  ■  ■ ,  xrn)  G  X  where  x  is  called  the  decision  vector ,  X  is  the  parameter  space 
or  search  space ,  and  y  is  the  objective.  A  solution  candidate  consists  of  a  particular 
(Vo,xq)  where  y0  =  f(x0). 

We  will  approach  the  optimization  problem  by  using  an  iterative  search  process. 
Given  a  set  X,  and  a  function  F,  which  maps  X  onto  itself,  we  define  an  iterative  search 
process  as  a  sequence  of  successive  approximations  to  F,  starting  with  an  x°  from  X, 
with  xr+1  =  F(xr)  for  r  =  (0, 1,2, . . .).  One  iteration  is  defined  as  a  consecutive 
determination  of  one  candidate  from  another  candidate  set  using  some  F.  For  an  evolu¬ 
tionary  algorithm,  one  iteration  consists  of  the  determination  of  one  generation  from  the 
previous  generation,  with  F  consisting  of  the  selection,  crossover,  and  mutation  rules. 

The  basic  idea  behind  simulated  heating  is  to  vary  the  local  search  parameter  p 
during  the  optimization  process.  This  is  in  contrast  to  the  more  commonly  employed 
technique  of  choosing  a  single  value  for  p  (typically  that  value  producing  highest  ac¬ 
curacy  of  the  local  search  L(jp))  and  keeping  it  constant  during  the  entire  optimization. 
Here,  we  start  with  a  low  value  for  p,  which  implies  a  low  cost  C (p),  and  accuracy  A(p) 
for  the  local  search,  and  increase  p  at  certain  points  in  time  during  the  optimization, 
which  increases  C(p)  and  A(p).  This  is  depicted  in  Figure  |8.5|,  where  the  dotted  line 
corresponds  to  simulated  heating,  and  the  dashed  line  corresponds  to  the  traditional  ap- 
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Figure  8.5:  Simulated  heating  vs.  traditional  approach  to  utilizing  local  search. 

proach.  The  goal  is  to  focus  on  the  global  search  at  the  beginning  and  to  find  promising 
regions  of  the  search  space  first;  for  this  phase,  L(p )  runs  with  low  accuracy,  which  in 
turn  allows  a  greater  number  of  optimization  steps  of  the  global  search  G.  Afterwords, 
more  time  is  spent  by  L(p)  in  order  to  improve  the  solutions  found  or  to  assess  them 
more  accurately.  As  a  consequence,  fewer  global  search  operations  are  possible  during 
this  phase  of  optimization.  Since  A(p)  is  systematically  increased  during  the  process, 
we  use  the  term  simulated  heating  for  this  approach  by  analogy  to  simulated  annealing 
where  the  ‘temperature’  is  continuously  decreased  according  to  a  given  cooling  scheme. 

8.3.2  Optimization  Scenario 

We  assume  that  we  have  a  global  search  algorithm  (GSA)[]  G  operating  on  a  set  of 
solution  candidates  and  a  PLSA  L(p),  where  p  is  the  parameter  of  the  local  search 

3In  this  thesis,  we  focus  on  an  evolutionary  algorithm  as  the  global  search  algorithm,  although  the 
approach  is  general  enough  to  hold  for  any  global  search  algorithm. 
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proceduref].  Let 

•  Cfix  define  the  maximum  (worst-case)  time  needed  by  G  to  generate  a  new 
solution  that  is  inserted  in  the  next  solution  candidate  set, 

•  C  ip)  denote  the  complexity  (worst-case  run-time)  of  L  for  the  parameter  choice 

P> 

•  A(p )  be  the  accuracy  (effectiveness)  of  L  with  regard  to  p,  and 

•  R  denote  the  set  of  permissible  values  for  parameter  p.  Typically,  R  may  be 
described  by  an  interval  [pmin  . . .  pmax]  fl  3?  where  3ft  denotes  the  set  of  reals  and 

C'(pmin)  <  C(p  max)  • 

Furthermore,  suppose  that  for  any  pair  (pi,p2)  of  parameter  values  we  have  that 

(Pi  <  P2)  =>  (C(pi)  <  C(p2))  and  (A(pi)  <  A(p2))  (8.1) 

That  is,  increasing  parameter  values  in  general  result  in  increased  consumption  of  compile¬ 
time,  as  well  as  increased  optimization  effectiveness. 

Generally,  it  is  very  difficult,  if  not  impossible,  to  analytically  determine  the  func¬ 
tions  C(p)  and  A(p),  but  these  functions  are  useful  conceptual  tools  in  discussing  the 
problem  of  designing  cooperating  GSA/PLSA  combinations.  The  techniques  that  we 
explore  in  this  thesis  do  not  require  these  functions  to  be  known.  The  only  requirement 
we  make  is  that  the  monotonicity  property  [R7T]  be  obeyed  at  least  in  an  approximate 
sense  (fluctuations  about  relatively  small  variations  in  parameter  values  are  admissible, 
but  significant  increases  in  the  PLSA  parameter  value  should  correspond  to  increasing 
cost  and  accuracy).  Consequently,  a  tunable  trade-off  emerges:  when  A(p)  is  low,  re¬ 
finement  is  generally  low  as  well,  but  not  much  time  is  consumed  (C(p)  is  also  low). 

4For  simplicity  it  is  assumed  here  that  p  is  a  scalar  rather  than  a  vector  of  parameters. 
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Conversely,  higher  A(p)  requires  higher  computational  cost  C(p).  We  define  simulated 
heating  as  follows: 

Definition  1:  [Heating  scheme] 

A  heating  scheme  H  is  a  triple  H  =  (II  r,  IIr.  Hse t)  where: 

•  Hr  is  a  vector  of  PLSA  parameter  values  with  HR  =  (pi, ... ,  pn). 

Pi  e  [pmin,  .  .  •  ,  Pmax],  and  p1  <  p2  <  .  .  .  <  pn, 

•  Hit  is  a  boolean  function,  which  yields  true  if  the  number  of  iterations  performed 
for  parameter  pi  does  not  exceed  the  maximum  number  of  iterations  allowed  for 
Pi,  and 

•  Hse t  is  a  boolean  function,  which  yields  true  if  the  size  of  the  solution  candidate 
set  does  not  exceed  the  maximum  size  for  pt  and  iteration  t  of  the  overall 
GSA/PLSA  hybrid. 

The  meanings  of  the  functions  Hit  and  Hset  will  become  clear  in  the  global/local 
hybrid  algorithm  of  Figure  |S.6[  which  is  taken  as  the  basis  for  the  optimization  scenario 
considered  in  this  thesis. 

The  GSA  considered  here  is  an  evolutionary  algorithm  (EA)  that  is 

1.  Generational,  i.e.,  at  each  evolution  step  an  entirely  new  population  is  created. 
This  is  in  contrast  to  a  non-generational  or  steady-state  EA  that  only  considers  a 
single  solution  candidate  per  evolution  step; 

2.  Baldwinian,  i.e.,  the  solutions  improved  by  the  PLSA  are  not  re-inserted  in  the 
population.  This  is  in  contrast  to  a  Lamarckian  EA,  in  which  solutions  would  be 
updated  after  PLSA  refinement. 
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Algorithm  8.1:  Global/Local  Hybrid() 

Input:  H  =  ((pi, . . . , pn ),  Hit,  Hset)  (heating  scheme) 

Tmax  (maximum  time  budget) 

Output:  s  (best  solution  found) 

Step  1:  Initialization:  Set  T  =  0  (time  used),  t  =  0  (iterations  performed),  and 
i  =  1  (current  PLSA  parameter  index). 

Step  2:  Heating:  Set  p  =  p,. 

Step  3:  Next  iteration:  Create  an  empty  multi-set  of  solution  candidates  St  =  0. 

Step  4:  Global  search:  If  t  =  0,  create  a  solution  candidate  s  at  random.  Otherwise, 
generate  a  new  solution  candidate  using  G  based  on  the  previous  solution 
candidate  set  St~i  and  the  associated  quality  function  Ft-\. 

Step  5 :  Local  search:  Apply  L  with  parameter  p  to  s  and  assign  it  a  quality  (fitness) 
F,(s). 

Step  6:  Termination  for  candidate  set:  Set  St  =  St  +  s  and  T  =  T  +  C\,x  +  C[p). 
If  the  condition  Hset  is  fulfilled  and  T  <  Tmax  then  go  to  Step  4. 

Step  7:  Termination  for  iteration:  Set  t  —  t  +  1.  If  the  condition  Hit  is  fulfilled 
and  T  <  Tmax  then  go  to  Step  3. 

Step  8:  Termination  for  algorithm:  If  i  <  n  increment  i.  If  T  <  Tmax  then  go  to 
Step  2. 

Step  9:  Output:  Apply  L  with  parameter  pmax  to  the  best  solution  in  Ur <i<t^t 
regarding  the  corresponding  quality  functions  F,\  the  resulting  solution  s  is 
the  outcome  of  the  algorithm. 

Figure  8.6:  Global/Local  Search  Hybrid. 
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8.4  Simulated  Heating  Schemes 


We  are  interested  in  exploring  optimization  techniques  in  which  the  overall  optimization 
time  is  fixed  and  specified  in  advance  (fixed  time  budget).  During  the  optimization 
and  within  this  time  budget,  we  allow  a  heating  scheme  to  adjust  three  optimization 
parameters  per  PLSA  parameter  value: 

1.  the  number  of  GS A  iterations  tp, 

2.  the  size  of  the  solution  candidate  set  Ni,  and 

3.  the  maximum  optimization  time  using  this  parameter  value  Tt. 

We  distinguish  between  static  and  dynamic  heating  based  on  how  many  of  the  pa¬ 
rameters  are  fixed  and  how  many  are  allowed  to  vary  during  the  optimization.  This  is 
illustrated  in  Figure  |8.7|.  In  our  experiments,  we  keep  the  size  of  the  solution  candidate 
(GA  population)  fixed,  and  thus  only  consider  the  FIS,  FTS,  and  VIT  strategies.  For  the 
sake  of  completeness,  however,  we  outline  all  these  strategies  below. 

8.4.1  Static  Heating 

Static  heating  means  that  at  least  two  of  the  above  three  parameters  are  fixed  and  iden¬ 
tical  for  all  PLSA  parameter  values  considered  during  the  optimization  process.  As  a 
consequence,  the  third  parameter  is  either  given  as  well  or  can  be  calculated  before  run¬ 
time  for  each  PLSA  parameter  value  separately.  As  illustrated  in  Figure  [O]  on  the  left, 
there  are  four  possible  static  heating  schemes. 
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Figure  8.7:  Illustration  of  the  different  types  of  i)  static  heating  and  ii)  dynamic  heating. 
For  static  heating,  at  least  two  of  the  three  attributes  are  fixed.  (FIS  refers  to  fixed 
iterations  and  population  size  per  parameter;  FTS  refers  to  fixed  time  and  population 
size  per  parameter;  FIT  refers  to  fixed  iterations  and  fixed  time  per  parameter.)  For 
dynamic  heating,  at  least  two  attributes  are  variable.  (VIT  refers  to  variable  iterations 
and  time  per  parameter;  VIS  refers  to  variable  iterations  and  population  size;  VTS 
refers  to  variable  time  and  population  size.  In  our  experiments,  we  will  only  consider 
the  FIS,  FTS,  and  VIT  strategies. 
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PLSA  Parameter  Fixed  —  Standard  Hybrid  Approach 


Fixing  all  three  parameters  is  identical  to  keeping  p  constant.  Thus,  only  a  single  PLSA 
parameter  value  is  used  during  the  optimization  process.  This  scheme  represents  the 
common  way  to  incorporate  PLSAs  into  GSAs  and  is  taken  as  the  reference  for  the 
other  schemes  as  actually  no  heating  is  performed. 


Number  of  Iterations  and  Size  of  Solution  Candidate  Set  Fixed  Per  PLSA  Param¬ 
eter  (FIS) 

In  this  strategy  (FIS),  the  parameter  pt  is  constant  for  exactly  L  =  tp  iterations.  The 
question  is,  therefore,  how  many  iterations  tp  may  be  performed  per  parameter  within 
the  time  budget  Tmax.  Having  the  constraint 


Tmax  >  tpN (Cs_x  +  C(pi))  +  tpN(Cfix  +  C(p2))  +  . . .  +  tpNiCftx  +  C(pn )) 


we  obtain  tp  with 


tp 


T 

1  rr 


NE'UC&x  +  C(Pt)) 

as  the  number  of  iterations  assigned  to  each  p,. 


(8.2) 


Amount  of  Time  and  Size  of  Solution  Candidate  Set  Fixed  Per  PLSA  Parameter 
(FTS) 

For  the  FTS  strategy,  the  points  in  time  where  p  is  increased  are  equi-distant  and  may 
be  simply  computed  as  follows.  Obviously  the  time  budget,  when  equally  split  between 
n  parameters,  becomes  Tp  =  Tmax/n  per  parameter.  Hence,  the  number  of  iterations  U 
that  may  be  performed  using  parameter  pit  i  —  1, . . . ,  n  is  restricted  by 

tiN(C&x  +  C(pi ))  <  Tp  ,  Vi  =  1, . . . ,  n 
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Thus,  we  obtain 


U  = 


T 

±  rv 


(8.3) 


nN{C^  +  C(pi))_ 

as  the  maximum  number  of  iterations  that  may  be  computed  using  parameter  Pi  in  order 
to  stay  within  the  given  time  budget. 


Number  of  Iterations  and  Amount  of  Time  Fixed  Per  PLSA  Parameter  (FIT) 

With  the  FIT  scheme  the  size  of  the  solution  candidate  set  is  different  for  each  PLSA  pa¬ 
rameter  considered.  The  time  per  iteration  for  parameter  is  given  by  T,  =  Tmax/fmax 
and  is  the  same  for  all  pi  with  1  <  i  <  n.  This  relation  together  with  the  constraint 

Ti  >  Ni(Csx  +  C(pi)) 


yields 


N,  = 


T 

1  rr 


_^max  (C&x  +  C(Pi)) 

as  the  maximum  size  of  the  solution  candidate  set  for  Pi. 


(8.4) 


8.4.2  Dynamic  Heating 

In  contrast  to  static  heating,  dynamic  heating  refers  to  the  case  in  which  at  least  two  of 
the  three  optimization  parameters  are  not  fixed  and  may  vary  for  different  PLSA  param¬ 
eters.  The  four  potential  types  of  dynamic  heating  are  shown  in  Figure  |8.7|.  However, 
the  scenario  where  all  three  optimization  parameters  are  variable  and  may  be  different 
for  each  PLSA  parameter  is  more  hypothetical  than  realistic.  This  approach  is  not  in¬ 
vestigated  in  this  thesis  and  only  listed  for  reasons  of  completeness.  Hence,  we  consider 
three  dynamic  heating  schemes  where  only  one  parameter  is  fixed.  One  of  the  vari¬ 
able  parameters  is  determined  dynamically  during  run-time  according  to  a  predefined 
criterion.  Here,  the  criterion  is  whether  an  improvement  with  regard  to  the  solutions 
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generated  can  be  observed  during  a  certain  time  interval  (measured  in  seconds,  num¬ 
ber  of  solutions  generated,  or  number  of  iterations  performed).  The  time  constraint  is 
defined  in  terms  of  the  remaining  variable  parameter. 

Number  of  Iterations  and  Size  of  Solution  Candidate  Set  Variable  Per  PLSA  Pa¬ 
rameter  (VIS) 

With  the  VIS  strategy,  the  time  7)  =  Tmax/n  per  PLSA  parameter  value  is  fixed  (and 
identical  for  all  p^.  If  the  time  constraint  is  defined  on  the  basis  of  the  number  of 
solutions  generated,  the  hybrid  works  as  follows:  As  long  as  the  time  T,  is  not  exceeded, 
new  solutions  are  generated  using  pt  and  copied  to  the  next  solution  candidate  set — 
otherwise,  the  next  GSA  iteration  with  pi+ 1  is  performed.  If,  however,  the  time  elapsed 
for  the  current  iteration  is  less  than  Tt  and  none  of  the  recently  generated  iVstag  solutions 
achieves  an  improvement  in  fitness,  the  next  iteration  with  p,  is  started. 

It  is  not  practical  to  consider  a  certain  number  of  iterations  as  the  time  constraint — 
since  the  time  per  iteration  is  not  known,  there  is  no  condition  that  determines  when  the 
filling  of  the  next  solution  candidate  set  can  be  stopped. 

Amount  of  Time  and  Size  of  Solution  Candidate  Set  Variable  Per  PLSA  Parameter 
(VTS) 

There  are  two  heating  schemes  possible  when  the  number  of  iterations  1,  per  PLSA 
parameter  is  a  constant  value  t j  =  tmax/n.  One  scheme  we  call  VTS-S,  in  which  the  next 
solution  candidate  set  is  filled  with  new  solution  candidates  until,  for  iVstag  solutions,  no 
improvement  in  fitness  is  observed.  In  this  case  the  same  procedure  is  applied  to  the 
next  iteration  using  the  same  parameter  pt.  If  tt  iterations  have  been  performed  for  p,, 
the  next  PLSA  parameter  pl+\  is  taken. 
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In  the  other  heating  scheme,  which  we  call  VTS-T,  the  filling  of  the  next  solution 
candidate  set  is  stopped  if,  for  rstag  seconds,  the  quality  of  the  best  solution  in  the 
solution  candidate  set  has  stagnated  (i.e.  has  not  improved). 

Number  of  Iterations  and  Amount  of  Time  Variable  Per  PLSA  Parameter  (VIT) 

Here  again  there  are  two  possible  variations.  The  first,  called  VIT-I,  considers  the  num¬ 
ber  of  iterations  as  the  time  constraint.  The  next  PLSA  parameter  value  is  taken  when 
for  a  number  tstag  of  iterations  the  quality  of  the  best  solution  in  the  solution  candidate 
set  has  not  improved.  As  a  consequence,  for  each  parameter  a  different  amount  of  time 
may  be  considered  until  the  stagnation  condition  is  fulfilled. 

The  alternative  VIT-T  is  to  define  the  time  constraint  in  seconds.  In  this  case,  the 
next  PLSA  parameter  value  is  taken  when,  for  Tstag  seconds,  no  improvement  in  fitness 
was  achieved.  As  a  consequence,  for  each  parameter  a  different  number  of  iterations 
may  be  considered  until  the  stagnation  condition  is  fulfilled. 

In  the  next  chapter  we  will  describe  some  experiments  to  verify  the  simulated  heating 
technique. 
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Chapter  9 


Simulated  Heating  Experiments 


Hybrid  global/local  search  techniques  are  most  effective  in  problems  with  complicated 
search  spaces,  and  problems  for  which  local  search  techniques  have  been  developed  that 
make  maximum  use  of  problem-specific  information.  We  investigate  the  effectiveness  of 
the  simulated  heating  approach  on  the  voltage  scaling  problem  for  embedded  multipro¬ 
cessors  described  in  Section  [Q|,  as  well  as  a  memory  compaction  problem  in  embedded 
systems.  These  problems  are  very  different  in  structure,  but  both  have  vast  and  com¬ 
plicated  solution  spaces.  In  addition,  the  parameterized  local  search  algorithms  (PLSA) 
for  these  applications  exhibit  a  wide  range  of  accuracy /complexity  trade-offs.  To  fur¬ 
ther  illustrate  the  utility  of  simulated  heating,  we  demonstrate  its  use  on  the  well-known 
binary  knapsack  problem. 

9.1  Simulated  Heating  for  Voltage  Scaling 

The  problem  of  dynamic  voltage  scaling  for  multiprocessors  was  introduced  in  Sec¬ 
tion  fO]  and  two  different  PLSAs  for  the  problem  were  presented  in  Section  |8.1.1|.  In 
this  section  we  explain  how  we  used  simulated  heating  to  solve  this  problem.  Experi¬ 
mental  results  are  given  in  Section 
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9.1.1  Voltage  Scaling  Problem  Statement 


We  assume  that  a  schedule  has  been  computed  beforehand  so  that  the  ordering  of  the 
tasks  on  the  processors  is  known.  The  optimization  problem  we  address  consists  of  find¬ 
ing  the  voltage  vector  V  =  (i/1}  u2, . . . ,  vn)  for  the  n  tasks  in  the  application  graph,  such 
that  the  energy  per  computation  period  (average  power)  is  minimized  and  the  through¬ 
put  satisfies  some  pre-specified  constraint  (e.g.,  as  determined  by  the  sample  period  in 
a  DSP  application).  For  each  task,  as  its  voltage  is  decreased,  its  energy  is  decreased 
and  its  execution  time  is  increased,  as  described  in  [0],  The  computation  period  is  de¬ 
termined  from  the  period  graph.  A  simple  example  is  shown  in  Figure  |9. 1|.  Here  we  can 
see  that  by  decreasing  the  voltage  on  task  B,  the  average  power  is  reduced  while  the 
execution  time  is  unchanged.  There  is  a  potentially  vast  search  space  for  many  practical 
applications.  For  example,  if  we  consider  discrete  voltage  steps  of  0.1  Volts  over  a  range 
of  5  Volts,  there  are  n50  possible  voltage  vectors  V  from  which  to  search.  The  number 
of  tasks  n  in  an  application  may  be  in  the  hundreds. 

9.1.2  GSA:  Evolutionary  Algorithm  for  Voltage  Scaling 

Each  solution  s  is  encoded  by  a  vector  of  positive  real  numbers  of  size  N  representing 
the  voltage  assigned  to  each  of  the  N  tasks  in  the  application.  The  one -point  crossover 
operator  randomly  selects  a  crossover  point  within  a  vector  then  interchanges  the  two 
parent  vectors  at  this  point  to  produce  two  new  offspring.  The  mutation  operator  ran¬ 
domly  changes  one  of  the  elements  of  the  vectors  to  a  new  (positive)  value.  At  each 
generation  of  the  EA  an  entirely  new  population  is  created  based  on  the  crossover  and 
mutation  operators.  The  crossover  probability  was  0.9,  the  mutation  probability  was  0.1, 
and  the  population  size  was  50. 
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Figure  9.1:  (a)  Period  Graph  before  voltage  scaling.  The  numbers  represent  execution 
times  (t)  and  energies  (e)  of  the  tasks.  The  execution  period  is  determined  by  the  longest 
cycle,  A  —>  B  — »  C,  whose  sum  of  execution  times  is  4  units.  The  energy  of  each  task 
is  4  units,  the  average  power  is  4  units  (16  total  energy  divided  by  period  of  4). 

(b)  After  voltage  scaling.  The  voltage  on  task  B  has  been  reduced,  increasing  its  execu¬ 
tion  time  from  1  unit  to  2  units  and  decreasing  its  energy  consumption  from  4  units  to  2 
units.  The  overall  execution  period  is  still  4  units  since  both  cycles  A  — >  D  C  and 
A  — >  B  C  now  have  execution  time  of  4.  The  average  power  is  3.5  units  (14  total 
energy  divided  by  period  of  4). 
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9.2  Simulated  Heating  for  Memory  Cost  Minimization 


In  order  to  further  demonstate  the  simulated  heating  technique,  we  apply  it  to  another  op¬ 
timization  problem  in  electronic  design  automation  that  has  a  complicated  search  space. 
This  section  will  explain  both  the  PLSA  and  the  GSA  for  this  problem.  Experimental 
results  are  given  in  Section  fTT]. 

9.2.1  Background 

Digital  signal  processing  (DSP)  applications  can  be  specified  as  dataflow  graphs  [QZZD  - 
As  explained  in  Chapter  [2].in  dataflow  a  computational  specification  is  represented  as  a 
directed  graph  in  which  vertices  ( actors )  specify  computational  functions  of  arbitrary 
complexity,  and  edges  specify  FIFO  communication  between  functions.  A  schedule  for 
a  dataflow  graph  is  simply  a  specification  of  the  order  in  which  the  functions  should 
execute.  A  given  DSP  application  can  be  accomplished  with  a  variety  of  different 
schedules — we  would  like  to  find  a  schedule  which  minimizes  the  memory  requirement. 
A  periodic  schedule  for  a  dataflow  graph  is  a  schedule  that  invokes  each  actor  at  least 
once  and  produces  no  net  change  in  the  number  of  data  items  queued  on  each  edge.  A 
software  synthesis  tool  generates  application  programs  from  a  given  schedule  by  piecing 
together  {inlining)  code  modules  from  a  predefined  library  of  software  building  blocks 
associated  with  each  actor.  The  sequence  of  code  modules  and  subroutine  calls  that  is 
generated  from  a  dataflow  graph  is  processed  by  a  buffer  management  phase  that  inserts 
the  necessary  target  program  statements  to  route  data  appropriately  between  actors. 

The  scheduling  phase  has  a  large  impact  on  the  memory  requirement  of  the  final 
implementations,  and  it  is  this  memory  requirement  we  wish  to  minimize  in  our  opti¬ 
mization.  The  key  components  of  this  memory  requirement  are  the  code  size  cost  (the 
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sum  of  the  code  sizes  of  all  inlined  modules,  and  of  all  inter-actor  data  transfers).  Even 
for  a  simple  dataflow  graph,  the  underlying  range  of  trade-offs  may  be  very  complex  We 
denote  a  schedule  loop  with  the  notation  [riT\T2  . . .  Tm),  which  specifies  the  successive 
repetition  n  times  of  a  subschedule  T{T2  . . .  Tm,  where  the  Tt  are  actors.  A  schedule  that 
contains  zero  or  more  schedule  loops  is  called  a  looped  schedule ,  and  a  schedule  that 
contains  exactly  zero  schedule  loops  is  called  a  flat  schedule  (thus,  a  flat  schedule  is  a 
looped  schedule,  but  not  vice-versa). 

Consider  two  schedules  S\  =  (8YZ)(2YZ)  and  S2  =  X(10YZ)  which  repeat  for 
the  actors  X,  Y,  and  Z  the  same  number  of  times  (1,  10,  10,  respectively).  The  code  size 
for  schedules  Si  and  S2  can  be  expressed,  respectively,  as  k(X)  +  k(Y)  +  k(Z)  +  Lc, 
where  Lc  denotes  the  processor-dependent,  code  size  overhead  of  a  software  looping 
construct,  and  k(A)  denotes  the  program  memory  cost  of  the  library  code  module  for 
an  actor  A.  The  code  size  of  schedule  S\  is  larger  because  it  contains  more  “actor 
appearances”  than  schedule  S2  (e.g.,  an  actor  Y  appears  twice  in  Si  vs.  only  once  in 
S2),  and  S' |  also  contains  more  schedule  loops  (2  vs.  1).  The  buffering  cost  of  a  schedule 
is  computed  as  the  sum  over  all  edges  e  of  the  maximum  number  of  buffered  (produced, 
but  not  yet  consumed)  tokens  that  coexist  on  e  throughout  execution  of  the  schedule. 
Thus,  the  buffering  costs  of  Si  and  S2  are  1 1  and  19,  respectively.  The  memory  cost  of  a 
schedule  is  the  sum  of  its  code  size  and  buffering  costs.  Thus,  depending  on  the  relative 
magnitudes  of  n(X),  k(Y),  k(Z),  and  Lc,  either  Si  or  S2  may  have  lower  memory  cost. 

9.2.2  MCMP  Problem  Statement 

The  memory  cost  minimization  problem  {MCMP)  is  the  problem  of  computing  a  looped 
schedule  that  minimizes  the  memory  cost  for  a  given  dataflow  graph,  and  a  given  set  of 
actor  and  loop  code  sizes.  It  has  been  shown  that  this  problem  is  NP-complete  [1771] .  A 
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tractable  algorithm  called  CDPPO  (code  size  dynamic  programming  post  optimization), 
which  can  be  used  as  a  local  search  for  MCMP,  has  also  been  described  Drama  03]. 
In  this  work  the  CDPPO  was  applied  uniformly  at  “full  strength”  (maximum  accu¬ 
racy/maximum  run-time),  and  as  conventionally  done  with  local  search  techniques,  did 
not  explore  application  of  its  PLSA  form.  As  explained  below,  the  CDPPO  algorithm 
can  be  formulated  naturally  as  a  PLSA  with  a  single  parameter  such  that  accuracy  and 
run-time  both  increase  monotonically  with  the  parameter  value. 

9.2.3  Implementation  Details  for  MCMP 

To  solve  the  MCMP  we  use  a  GSA/PLSA  hybrid  where  an  evolutionary  algorithm  is  the 
GSA  and  CDPPO  is  the  PLSA.  The  evolutionary  algorithm  and  parameterized  CDPPO 
are  explained  below. 

9.2.4  GSA:  Evolutionary  Algorithm  for  MCMP 

Each  solution  s  is  encoded  by  an  integer  vector,  which  represents  the  corresponding 
schedule,  i.e.,  the  order  of  actor  executions  (firings ).  The  decoding  process  that  takes 
place  in  the  local  search/evaluation  phase  (step  5  in  Figure  [H~Gp  is  as  follows: 

•  First  a  repair  procedure  is  invoked,  which  transforms  the  encoded  actor  firing 
sequence  into  a  valid  flat  schedule. 

•  Next  the  parameterized  CDPPO  is  applied  to  the  resulting  flat  schedule  in  order 
to  compute  a  (sub)optimal  looping,  and  afterward  the  data  requirement  (buffering 
cost)  D(s)  and  the  program  requirement  (code  size  cost)  P(s)  of  the  software 
implementation  represented  by  the  looped  schedule  are  calculated  based  on  a 
certain  processor  model. 
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Finally,  both  D(s)  and  P(s)  are  normalized  (the  minimum  values  Dmin  and  Pmm  and 
maximum  values  Dmax  and  Pmax  for  the  distinct  objectives  can  be  determined  before¬ 
hand)  and  a  fitness  is  assigned  to  the  solution  s  according  to  the  following  formula: 


F(s)  =  0.5 


D(s)  -  D 


nun  g  g -P(s)  Pm 


D  —D  P  —P 

-'-'max  -L'min  1  max  1  mm 


(9.1) 


Note  that  the  fitness  values  are  to  be  minimized  here. 


9.2.5  PLSA:  Parameterized  CDPPO  for  MCMP 

The  “unparameterized”  CDPPO  algorithm  was  first  proposed  in  [11  hi].  CDPPO  computes 
an  optimal  parenthesization  in  a  bottom-up  fashion,  which  is  analogous  to  dynamic  pro¬ 
gramming  techniques  for  matrix-chain  multiplication  [GEO-  Given  a  dataflow  graph  G  = 
(V,  E )  and  an  actor  invocation  sequence  (flat  sequence)  f\ ,  /2, . . . ,  /„,  where  each  j\  e 
V,  CDPPO  first  examines  all  2-invocation  sub-chains  (/),  /2),  (J2.  ff), . . . ,  fn)  to 
determine  an  optimally-compact  looping  structure  ( subschedule )  for  each  of  these  sub¬ 
chains.  For  a  2-invocation  sub-chain  (/;,  /i+1),  the  most  compact  subschedule  is  eas¬ 
ily  determined:  if  /*  =  fi+1,  then  (2/,)  is  the  most  compact  subschedule,  otherwise 
the  original  (unmodified)  subschedule  /,;/j+1  is  the  most  compact.  After  the  optimal 
2-node  subschedules  are  computed  in  this  manner,  these  subschedules  are  used  to  de¬ 
termine  optimal  3-node  subschedules  (optimal  looping  structures  for  subschedules  of 
the  form  /;,  fi+1,  /,+2);  and  the  2-  and  3-node  subschedules  are  then  used  to  determine 
optimal  4-node  subschedules,  and  so  on  until  the  77-node  optimal  subschedule  is  com¬ 
puted,  which  gives  a  minimum  code  size  implementation  of  the  input  invocation  se¬ 
quence  ./),./ 2,  •  •  •  ,  fn- 

Due  to  its  high  complexity,  CDPPO  can  require  significant  computational  resources 
for  a  single  application — e.g.,  we  have  commonly  observed  run-times  on  the  order  of 


175 


30-40  seconds  for  practical  applications.  In  the  context  of  global  search  techniques, 
such  performance  can  greatly  limit  the  number  of  neighborhoods  (flat  schedules)  in 
the  search  space  that  are  sampled.  To  address  this  limitation,  however,  a  simple  and 
effective  parameterization  emerges:  we  simply  set  a  threshold  M  on  the  maximum  sub¬ 
chain  (subschedule)  size  to  which  optimization  is  attempted.  This  threshold  becomes 
the  parameter  of  the  resulting  parameterized  CDPPO  (PCDPPO)  algorithm. 

In  summary,  PCDPPO  is  a  parameterized  adaptation  of  CDPPO  for  addressing  the 
schedule  looping  problem.  The  run-time  and  accuracy  of  PCDPPO  are  both  monotoni- 
cally  nondecreasing  functions  of  the  algorithm  “threshold”  parameter  M.  In  the  context 
of  the  memory  minimization  problem,  PCDPPO  is  a  genuine  PLSA. 


9.3  Simulated  Heating  for  Binary  Knapsack  Problem 


In  order  to  further  illuminate  simulated  heating,  we  begin  by  demonstrating  the  tech¬ 
nique  on  a  widely  known  problem,  namely  the  binary  (0-1)  knapsack  problem  (KP). 
This  problem  has  been  studied  extensively,  and  good  exact  solution  methods  for  it  have 
been  developed  (e.g.  see  [EBD).  The  exact  solutions  are  based  on  either  branch-and- 
bound  or  dynamic  programming  techniques.  In  this  problem,  we  are  given  a  set  of  n 
items,  each  with  profit  A j  and  weight  Wj,  which  must  be  packed  in  a  knapsack  with 
weight  capacity  c.  The  problem  consists  of  selecting  a  subset  of  the  n  items  whose  total 
weight  does  not  exceed  c  and  whose  total  profit  is  a  maximum.  This  can  be  expressed 
formally  as: 


n 


maximize  z  =  E  Aixi 


(9.2) 


subject  to 


n 


(9.3) 
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Xj  6  0, 1  ,  j  G  1, ...  ,/i 


(9.4) 


where  x3  =  1  if  item  j  is  selected,  and  x3  =  0  otherwise. 

Balas  and  Zemel  [4’J  first  introduced  the  “core  problem”  as  an  efficient  way  of  solv¬ 
ing  KP,  and  most  of  the  exact  algorithms  have  been  based  on  this  idea.  Pisinger  [K7I] 
has  modeled  the  hardness  of  the  core  problem  and  noted  that  is  is  important  to  test  at  a 
variety  of  weight  capacities.  He  proposed  a  series  of  randomly  generated  test  instances 
for  KP.  In  our  experiments  we  generate  test  instances  using  the  test  generator  function 
described  in  appendix  B  of  [S871J .  We  compare  our  results  to  the  exact  solution  described 
in  [J8EJ],  for  which  the  C-code  can  be  found  in  [QZLOJ . 

9.3.1  Implementation 

To  solve  the  KP  we  use  a  GS  A/PLS  A  hybrid  as  discussed  in  Section  |0]  where  an  evolu¬ 
tionary  algorithm  is  the  global  search  algorithm  (GSA)  and  a  simple  pairwise  exchange 
is  the  parameterized  local  search  algorithm  (PLSA).  The  evolutionary  algorithm  and 
local  search  are  explained  below: 

GSA:  Evolutionary  Algorithm 

Each  candidate  solution  s  is  encoded  as  a  binary  vector  x,  where  x3  are  the  binary 
decision  variables  from  equation  [TT]  above.  The  weight  of  a  given  solution  candidate  s 
is  ws  =  YTj=i  x3wv  and  the  profit  of  s  is  As  =  V''= ,  x3  A:J .  The  sum  of  the  profits  of 
all  items  is  defined  as  At  =  YTj=i  Ar  define  a  fitness  function  which  we  would  like 
to  minimize: 


(9.5) 


Thus  we  penalize  solution  candidates  whose  weight  exceeds  the  capacity,  and  seek 
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to  maximize  the  profit.  The  At  term  was  added  so  that  F(s)  is  never  negative.  For 
the  KP  experiments  we  used  a  standard  simple  genetic  algorithm  described  in  [130 
with  one  point  crossover,  crossover  probability  0.9,  non-overlapping  populations  of  size 
popsize  =  100,  and  elitism. 

Parameterized  Local  Search  for  Knapsack  Problem 

At  the  beginning  of  the  optimization  algorithm,  the  items  are  sorted  by  increasing 
profit,  so  that  Aj  <  Ay  for  all  i  <  j.  Given  an  input  solution  candidate  s,  the  local 
search  first  computes  its  weight  ws.  If  ws  >  c,  items  are  removed  (ay  set  to  zero) 
starting  at  i  —  0  until  ws  <  c.  For  local  search  parameter  p  =  1,  this  is  the  only 
operation  performed.  For  p  >  1,  pair  swap  operations  are  also  performed  as  explained 
in  Figure  |9.2[  where  we  attempt  to  replace  an  item  from  the  solution  candidate  with  a 
more  profitable  item  not  included  in  the  solution  candidate.  The  number  of  such  pair 
swap  operations  is  p.  Thus  the  local  search  algorithm  requires  more  computation  time 
and  searches  the  local  area  more  thoroughly  for  higher  p.  These  are  the  monotonicity 
requirements  expressed  in  Equation  |8 . 1 1.  We  define  parameter  p  =  0  as  no  local  search- 
i.e.  the  optimization  is  an  evolutionary  algorithm  only,  and  no  local  search  is  performed. 

9.4  Experiments 

In  this  section  we  present  experiments  designed  to  examine  several  aspects  of  simu¬ 
lated  heating  for  the  two  embedded  systems  applications.  We  would  like  to  know  how 
simulated  heating  compares  to  the  standard  hybrid  technique  of  using  a  fixed  parameter 
(fixed  p).  We  summarize  the  fixed  p  results  for  all  problems  for  different  values  of  p. 
We  examine  how  the  optimal  value  of  p  for  the  standard  hybrid  method  depends  on  the 
application. 
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Algorithm  9.1:  Pair  Swap  Local  Search! 


sin  >  ^  ‘  s  ou  t 


) 


input:  solution  candidate  sjn  of  size  n 

input:  fitness  function  F 

output:  new  solution  candidate  sout 


i 


n 


sout  sin 

bestScore  <—  F(sjn) 
count  <—  0 

while  (i  >  0)  A  (count  <  pn) 

h  =  - 1 


do  < 


i  =  i  —  1 

while  (j  <  i)  A  (count  <  pn) 
f  3=3  +  1 


do  < 


if  soutM 


then  < 


7^  South] 
temp  =  aout[i] 

SOUtM  =  South] 

South']  =  temP 
score  =  F(sout) 
if  score  <  bestScore 

(  bestScore  =  score 


then  < 


else 


temp  =  sOVLt[i] 
Soutl*]  =  South] 
South']  =  temP 


Figure  9.2:  Pseudo-code  for  pair  swap  local  search  for  binary  knapsack  problem. 
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Next  we  compare  both  the  static  and  dynamic  heating  schemes  to  the  standard  ap¬ 
proach,  and  to  each  other.  For  the  static  heating  experiments,  we  utilize  the  FIS  and 
FTS  strategies.  Recall  that  FIS  refers  to  fixed  number  of  iterations  and  population  size 
per  parameter,  and  FTS  refers  to  fixed  time  and  population  size  per  parameter.  For  the 
dynamic  heating  experiments,  we  utilize  the  two  variants  of  the  VIT  strategy  (variable 
iterations  and  time  per  parameter).  We  also  examine  the  role  of  parameter  range  and 
population  size  on  the  optimization  results. 

9.4.1  PLSA  Run-Time  and  Accuracy  for  Voltage  Scaling  and  MCMP 

Recall  that  there  is  a  trade-off  between  accuracy  and  run-time  for  the  PLSA.  Lower 
values  of  local  search  parameter  p  mean  the  local  search  executes  faster,  but  is  not  as 
accurate.  Figure  [O]  shows  how  the  run-time  of  the  PLSA  varies  with  p  for  the  two 
applications.  It  can  be  seen  that  the  monotonicity  property,  Equation  [01  is  satisfied  for 
the  PLSAs. 

9.4.2  Standard  Hybrid  Approach  for  Voltage  Scaling  and  MCMP 

The  standard  approach  to  hybrid  global/local  searches  is  to  run  the  local  search  at  a  fixed 
parameter.  We  present  results  for  this  method  below.  It  is  important  to  note  that,  for  a 
fixed  optimization  run-time,  the  optimal  value  of  local  search  parameter  p  can  depend  on 
the  run-time  and  data  input  and  cannot  be  predicted  in  advance.  Figure  [O]  shows  results 
for  the  MCMP  optimization  using  fixed  values  of  p  (standard  approach-no  heating),  for 
11  different  initial  populations,  for  population  sizes  N  =  100  and  N  =  200.  The  y- 
axis  on  these  graphs  corresponds  to  the  memory  cost  of  the  optimized  schedule  so  that 
lower  values  are  better.  The  x-axis  corresponds  to  the  fixed  p  value.  For  each  value 
of  p,  the  hybrid  search  was  run  for  a  time  budget  of  5  hours  with  a  fixed  value  of  p. 
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Run  Times  vs.  PLSA  Parameter  p 


(a)  MCMP  application. 


Run  Times  for  Voltage  Scaling  on  FFT3 


(b)  Voltage  scaling  application. 

Figure  9.3:  Local  search  run  times  vs.  p  for  MCMP  application  (a)  and  voltage  scaling 
application  (b). 
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N=1 00 


(a)  N  =  100 


N=200 


(b)  N  =  200 

Figure  9.4:  Standard  hybrid  approach  to  MCMP  application  using  fixed  PLSA  param¬ 
eter  p.  Hybrid  was  run  for  5  hours  at  each  value  of  p.  Population  size  for  GA  was 
N  =  100  in  |9.4(a)|  and  N  =  200  in  |9.4(b)|.  Median,  lower  quartile,  and  upper  quartile  of 
1 1  different  runs  shown  in  the  three  curves  for  each  p.  (Lower  memory  cost  is  better). 
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The  same  set  of  initial  populations  was  used.  From  these  graphs,  it  can  be  seen  that  the 
local  search  performs  best  for  values  of  p  around  39.  Figure  [73]  shows  the  number  of 
iterations  (generations  in  the  GSA)  performed  for  each  value  of  p.  As  p  increases,  fewer 
generations  can  be  completed  in  the  fixed  optimization  run  time. 

Figure  |93|  shows  results  for  the  voltage  scaling  application  on  6  different  input 
dataflow  graphs,  for  fixed  values  of  p  (no  heating),  for  11  different  initial  populations, 
using  both  hill  climb  and  Monte  Carlo  local  search  methods.  For  each  value  of  p,  the 
hybrid  search  was  run  for  a  time  budget  of  20  minutes  with  a  fixed  value  of  p.  The 
y-axis  on  the  graph  corresponds  to  the  ratio  of  the  optimized  average  power  to  the  initial 
power,  so  that  lower  values  are  better.  For  each  p,  the  same  set  of  initial  populations  was 
used.  From  these  graphs,  it  can  be  seen  that  the  best  value  of  p  may  also  depend  on  the 
specific  problem  instance. 

9.4.3  Static  Heating  Schemes  for  Voltage  Scaling  and  MCMP 

For  the  MCMP  application,  the  run-time  limit  for  the  hybrid  was  set  to  Tmax  =  5  hours. 
Two  sets  of  PLSA  parameters  were  used,  R 1  =  [1, 153,  305, 457,  612]  and 
R2  =  [1,  39,  77, 116, 153].  The  value  of  p  —  612  corresponds  to  th  total  number  of  actor 
invocations  in  the  schedule  for  the  MCMP  application  and  is  thus  the  maximum  (highest 
accuracy)  possible.  The  parameter  set  R2  was  chosen  so  that  it  is  centered  around  the 
best  fixed  p  values.  Figure  |977]  summarizes  the  results  for  the  MCMP  application  with 
GSA  population  size  N  =  100.  In  Figure  |9.7|.  eleven  runs  were  performed  for  each 
heating  scheme  and  for  each  parameter  set.  The  box  plot  []  Figure  fT7|(a)  corresponds 

'The  ‘box’  in  the  box  plot  stretches  form  the  25th  percentile  (‘lower  hinge’)  to  the  75th  percentile 
(‘upper  hinge’).  The  median  is  shown  as  a  line  across  the  box.  The  ‘whisker’  lines  are  drawn  at  the  10th 
and  90th  percentiles.  Outliers  are  shown  with  a  “+’  character. 
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Number  of  generations(iterations)  performed  for  each  p 


Figure  9.5:  Standard  hybrid  approach  (fixed  p,  no  heating),  MCMP  application,  using  a 
fixed  run  time.  Number  of  generations  completed  is  shown  for  hybrids  utilizing  different 
values  of  p.  Fewer  generations  are  completed  for  higher  p. 
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(a)  Monte  Carlo  local  search. 


(b)  Hill  climb  local  search. 

Figure  9.6:  Standard  hybrid  approach  using  fixed  PLSA  parameters,  voltage  scaling 
application,  with  Monte  Carlo  local  search  in  |9.6(a)|  and  hill  climb  local  search  in  |9.6(b)|. 
Hybrid  was  run  for  20  minutes  at  each  value  of  p.  Median  of  1 1  runs  for  each  p.  Lower 
values  of  power  are  better.  We  see  that  the  optimal  value  of  p  is  different  for  the  six 
different  input  dataflow  graphs. 
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MCMP  /  N  =  100:  Fixed  p  (curve)  -  FIS(left  boxplots)  -  FTS(right  boxplots) 


{FIS,R1}  {FIS.R2}  {FTS,R1}  {FTS.R2} 


Figure  9.7:  Static  heating  for  MCMP  with  the  local  search  parameter  p  varied  in  two 
different  ranges — the  first  range  covers  all  possible  values  (1  —  612),  while  the  second 
range  (1  —  153)  is  concentrated  around  the  best  fixed  p  value.  (a)[FIS,i?1],  (b)[FIS,f?2], 
(c)[FTS,i?1],  (d)[FTS,f?2].  The  solid  curve  depicts  the  standard  hybrid  approach  for 
different  values  of  p.  Lower  values  of  cost  are  better.  The  box  plots  display  the  static 
heating  results.  The  solid  line  across  the  box  represents  the  median  over  all  calculations. 
The  lowest  cost  is  obtained  for  the  standard  hybrid  approach  with  p  =  39.  The  best  static 
heating  scheme  is  (d),  corresponding  to  FTS  operating  in  the  restricted  parameter  range 
which  includes  p  —  39.  We  note  that  this  value  of  p  could  not  be  determined  in  advance, 
and  could  only  be  found  by  running  the  standard  hybrid  solution  for  all  values  of  p. 
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Pleating  scheme 

Iterations  per  parameter  p 

type 

range 

1 

39 

77 

115 

153 

305 

457 

612 

FIS 

[1,612] 

4 

X 

X 

X 

4 

4 

4 

4 

FIS 

[1,153] 

33 

33 

33 

33 

33 

X 

X 

X 

FTS 

[1,612] 

176 

X 

X 

X 

14 

4 

2 

2 

FTS 

[1,153] 

175 

94 

42 

23 

14 

X 

X 

X 

Table  9.1:  Iterations  performed  per  parameter  value  for  four  different  heating  schemes 
for  MCMP.  The  numbers  correspond  to  a  single  optimization  run.  For  the  other  ten  runs 
they  look  slightly  different. 

to  FIS  with  parameter  set  R1.  Figure  |9.7|(  b)  corresponds  to  FIS  with  parameter  set  R2. 
Figure  Rl\(c)  corresponds  to  FTS  with  parameter  set  R 1 .  Figure  p/7|(  d)  corresponds  to 
FTS  with  parameter  set  R2.  The  solid  curves  in  Figure  [T7]  are  the  results  for  fixed  p. 
Table  [97T]  summarizes  the  iterations  performed  for  each  parameter  for  both  FIS  and  FTS 
with  both  parameter  ranges. 

For  the  voltage  scaling  application,  we  ran  the  static  heating  optimization  for  a  run¬ 
time  of  Tmax  minutes.  For  FIS  and  FTS,  the  parameter  sets  used  were  R3  =  [1,  2,  3, 4,  5] 
and  R 4  =  [2.25,  2.50,  2.75,  3.00, 3.25].  The  parameter  set  R3  was  chosen  by  examining 
the  fidelity  of  the  period  graph  estimator.  Recall  that  the  PLS  A  parameter  p  is  related  to 
the  re-simulation  threshold.  It  is  observed  that  for  p  <  1  the  fidelity  of  the  estimator  is 
poor.  For  p  greater  than  5,  with  the  voltage  increments  used,  the  re-simulation  threshold 
is  so  small  that  simulation  is  done  almost  every  time.  This  corresponds  to  the  highest 
accuracy  setting.  The  parameter  set  RA  was  chosen  to  center  around  the  best  fixed  p 
values.  Results  for  FIS  and  FTS  on  the  FFT2  application  using  the  Monte  Carlo  local 
search  are  shown  in  Figure  |9.S|.  The  box  plot  in  Figure  |9.8|(  a)  corresponds  to  FIS  with 
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FFT2  /  Static  /  both  p  ranges  /  Monte  Carlo  local  search 


No  heating  /  fixed  p  /  Monte  Carlo  /  fft2 


(a)  Static  heating(box  plots),  p  fixed  (upper  (b)  The  2  ranges  for  fixed  p. 

curve). 


Figure  9.8:  Static  heating  for  voltage  scaling  with  different  parameter  ranges — 
(a)[FIS,f?3],  (b)[FIS,i?4],  (c)[FTS,.R3],  (d)[FTS,i?4]  (shown  in  the  four  box  plots)  com¬ 
pared  with  the  standard  hybrid  method  results  (fixed  values  of  p  shown  in  the  solid  line). 
Here  the  static  heating  schemes  all  perform  better  than  the  standard  hybrid  approach. 
The  first  parameter  range  includes  all  values  of  p,  while  the  second  range  is  centered 
around  the  best  fixed  p  value.  This  is  shown  in  more  detail  in  |9-8(b)|. 
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parameter  range  R3.  Figure  |9^8|(b)  corresponds  to  FIS  with  parameter  range  R4.  Figure 
P(c)  corresponds  to  FTS  with  parameter  set  R3.  Figure  ^8|(d)  corresponds  to  FTS  with 
parameter  range  R 4.  The  solid  curves  in  the  figure  are  the  results  for  fixed  p. 

9.4.4  Dynamic  Heating  Schemes  for  Voltage  Scaling  and  MCMP 

We  performed  the  dynamic  heating  schemes  VIT.I  and  VIT.T  for  both  the  MCMP  and 
voltage  scaling  applications.  Recall  that  VIT  stands  for  variable  iterations  and  time  per 
parameter;  during  the  optimization  the  next  PLSA  parameter  is  taken  when,  for  a  given 
number  fstag  of  iterations  (VIT.I)  or  a  given  time  Tstag  (VIT.T),  the  quality  of  the  solution 
candidate  has  not  improved. 

For  the  MCMP  application,  the  run-time  limit  for  the  hybrid  was  set  to  Tmax  = 
5  hours  and  the  same  two  sets  of  PLSA  parameters  were  used  as  in  the  static  heating 
case.  Eleven  runs  were  performed  for  all  cases.  Results  for  dynamic  heating  on  the 
MCMP  application  are  shown  in  Figure  |T9|  For  the  voltage  scaling  application,  the  run 
time  was  Tmax  =  20  minutes.  Results  for  voltage  scaling  with  VIT.I  and  VIT.T  using  the 
Monte  Carlo  local  search  are  shown  in  Figure  |9.10|.  For  the  dynamic  heating  schemes, 
the  search  algorithm  operates  with  a  given  PLSA  parameter  until  the  quality  of  the  best 
solution  has  not  improved  for  either  fstag  iterations  (VIT.I)  or  Tstag  seconds  (VIT.T).  It  is 
therefore  interesting  to  observe  the  amount  of  time  spent  on  each  parameter  during  the 
optimization.  This  is  illustrated  in  Figure  |9.1 1|. 

9.4.5  Knapsack  PLSA  Run-Time  and  Accuracy 

To  test  the  binary  knapsack  problem,  we  generated  1000  pseudo-random  test  instances 
for  each  technique  as  suggested  in  [1871] .  The  weights  and  profits  in  these  instances 
were  strongly  correlated.  The  weight  capacity  c,  of  the  /th  instance  is  given  by  c,  = 
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MCMP  /  N  =  100  :  Fixed  p(curves)  :  VIT— l(left  boxplots)  :  VIT-T  (right  boxplots) 


{VIT.  I,  R1 }  {VIT.I,R2}  {VIT.T,R1}  {VIT.T,R2} 


Figure  9.9:  Dynamic  heating  for  MCMP  with  different  parameter  ranges  depicted  by  the 
four  box  plots — (a)[VIT.I,f?1],  (b)[VIT.I,f?2],  (c)[VIT.T .i?1],  (d)[VIT.T,f?2].  The  solid 
line  represents  the  standard  hybrid  technique  with  p  fixed  at  different  values  from  1  to 
612.  The  solid  lines  across  the  boxes  represents  the  median  over  all  calculations.  The 
lowest  cost  is  obtained  for  the  standard  hybrid  approach  with  p  —  39.  The  best  dy¬ 
namic  heating  scheme  is  (d),  corresponding  to  VIT.T  operating  in  the  restricted  parame¬ 
ter  range  which  includes  p  =  39.  We  note  that  this  value  of  p  could  not  be  determined  in 
advance,  and  could  only  be  found  by  running  the  standard  hybrid  solution  for  all  values 
of  p. 
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fft2  /  Dynamic  /  both  p  ranges  /  Monte  Carlo  local  search 


1  1.5  2  2.5  3  3.5  4  4.5  5 

fixed  PLSA  parameter  p 


Figure  9.10:  Dynamic  heating  for  voltage  scaling  with  different  parameter 
ranges  depicted  by  the  four  box  plots — (a)[VIT.I,f?3],  (b)[ VIT.I./f4].  (c)[ VIT.T./r:!|- 
(d)[VIT.T,/?4].  VIT.T  refers  to  variable  iterations  and  time  per  parameter,  with  the  next 
parameter  taken  if,  for  a  given  time,  the  solution  has  not  improved.  The  solid  curve 
depicts  results  for  the  standard  hybrid  approach.  All  the  dynamic  schemes  outperform 
the  standard  hybrid  (fixed  p )  approach,  with  the  lowest  average  power  obtained  for  (a) 
VIT.I  which  utilizes  the  broader  parameter  range. 
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MCMP/VIT.T/R1  /  N=1 00 


1  153  305  457  612 

PLSA  parameter  p 


(a)  Parameter  range  R 1 


(b)  Parameter  range  R2 


Figure  9.11:  Percent  of  time  spent  on  each  parameter  in  range  R 1  (a)  and  in  range  R2 
(b)  for  VIT.T. 
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Figure  9.12:  Local  search  run  times  vs.  p  for  binary  knapsack  problem. 

|_(iPF)/1001J  where  W  is  the  sum  of  the  weights  of  all  items.  For  each  test  instance  we 
compared  the  hybrid  solution  with  an  exact  solution  to  the  problem  using  the  method 
given  in  [ESO-  We  defined  an  error  sum  over  all  the  problem  instances  as  a  figure  of  merit 
for  the  hybrid  solution  technique: 


1000 


(9.6) 


where  a,  is  the  profit  given  by  the  exact  solution  and  f3i  is  the  profit  given  by  the  hybrid 
solution. 

Figure  |9.f  2|  shows  how  the  run-time  of  the  pair  swap  PLSA  increases  with  p.  Fig¬ 
ure  |9.13|  depicts  the  sum  of  errors  (Equation  |9.h|)  for  the  binary  knapsack  problem  for 
different  values  of  p  with  the  number  of  generations  fixed  at  10.  We  can  see  that  higher 
values  of  p  produce  smaller  error,  at  the  expense  of  increased  run  time.  Thus  the  pair 
swap  PLSA  satisfies  the  monotonicity  requirement  from  Equation  |8. 1|. 
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Figure  9.13:  Standard  hybrid  approach  for  binary  knapsack  (fixed  p,  no  heating)  using 
a  fixed  number  of  generations  and  not  fixing  overall  hybrid  run  time.  Cumulative 
error  shown  for  hybrids  utilizing  different  p.  Higher  p  is  more  accurate  but  requires 
longer  run  times. 
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9.4.6  Knapsack  Standard  Hybrid  Approach 

The  standard  hybrid  approach  to  hybrid  global/local  searches  is  to  run  the  local  search 
at  a  fixed  parameter.  This  is  shown  in  Figure  |9.14|  for  different  values  of  p  and  for  two 
different  run  times.  Here  the  y-axis  corresponds  to  the  sum  of  errors  over  all  test  cases 
(Equation  |9.6|).  We  see  that,  for  a  fixed  optimization  run-time,  the  optimal  value  of  local 
search  parameter  p  using  the  standard  hybrid  approach  can  depend  on  the  run-time  and 
data  input — for  a  run  time  of  2  seconds,  the  best  value  of  p  is  2,  while  for  a  run  time  of 
5  seconds,  the  best  value  of  p  is  5.  We  note  here  and  with  the  other  applications  studied 
that  this  value  of  p  cannot  be  predicted  in  advance. 

9.4.7  Knapsack  Static  Heating  Schemes 

The  static  heating  schemes  FIS  and  FTS  were  performed  for  the  binary  knapsack  prob¬ 
lem.  Results  are  shown  in  Figure  |9.15|  for  run  times  of  1  and  5  seconds,  and  compared 
with  the  standard  hybrid  approach  for  different  values  of  p.  It  can  be  seen  that  the  static 
heating  scheme  outperformed  the  standard  hybrid  approach,  and  that  this  improvement 
is  greater  for  the  shorter  run  times. 

9.4.8  Knapsack  Dynamic  Heating  Schemes 

The  dynamic  heating  schemes  VIT.I  and  VIT.T  were  performed  for  the  binary  knap¬ 
sack  application.  Recall  that  VIT  stands  for  variable  iterations  and  time  per  parameter; 
during  the  optimization  the  next  PFSA  parameter  is  taken  when,  for  a  given  number 
of  iterations  (VIT.I)  or  a  given  time  (VIT.T),  the  quality  of  the  solution  candidate  has 
not  improved.  Figure  |9.1b|  shows  results  for  these  dynamic  schemes.  Results  for  static 
heating  schemes  are  shown  on  the  right  for  comparison.  We  observe  that  the  dynamic 
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Figure  9.14:  Standard  hybrid  approach  applied  to  binary  knapsack  for  different 
values  of  p,  where  p  is  fixed  throughout.  Y-axis  is  sum  of  errors.  Run  time  is  2  seconds 
in  (a)  and  5  seconds  in  (b). 


x  i  o4  Binary  Knapsack  run  time  =  5  Fixed  p 


p=0  p=1  p=2  p=5 


(b)  Run  time  5  seconds. 
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Binary  Knapsack  run  time  =  1 


p=0  p=1  p=2  p=5  FTS  FIS 


(a)  Run  time  1  second. 


p=0  p=1  p=2  p=5  FTS  FIS 


(b)  Run  time  5  seconds. 

Figure  9.15:  Static  heating  (2  bars  on  right)  applied  to  binary  knapsack  compared  to 
the  standard  hybrid  approach  (4  bars  on  left).  Y-axis  is  sum  of  errors  over  all  1000 
problem  instances.  The  4  bars  on  left  correspond  to  the  standard  hybrid  approach.  Run 
time  is  1  second  in  |9.15(a)| and  5  seconds  in  |9.15(b)|. 
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(a)  Run  time  1  second 


(b)  Run  time  5  seconds 

Figure  9.16:  Dynamic  heating  for  binary  knapsack  (two  bars  on  right)  compared  to 
static  heating  (two  bars  on  left).  VIT  refers  to  variable  iterations  and  time  per  parame¬ 
ter,  with  the  next  parameter  taken  if,  for  a  given  number  of  iterations  (VIT.I)  or  a  given 
time  (VIT.T),  the  solution  has  not  improved.  Run  time  is  1  second  in  |9.16(a)|  and  5  sec¬ 
onds  in  |9.16(b)|.  Y-axis  is  cumulative  error  over  all  problem  instances  (note  the  different 
y  scales  for  the  two  plots). 
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Figure  9.17:  Relationship  between  the  value  of  p  and  the  outcome  of  the  optimization 
process. 

heating  schemes  outperform  the  static  heating  schemes  significantly,  and  that  the  amount 
of  improvement  is  greater  for  shorter  run  times. 

9.5  Comparison  of  Heating  Schemes 

The  results  indicate  that  the  choice  of  parameter  p  does  affect  the  outcome  of  the  opti¬ 
mization  process.  For  the  MCMP  application,  there  is  a  pronounced  region  for  fixed  p 
values  around  p  =  39  where  the  hybrid  (with  p  fixed)  performs  best.  This  is  illustrated 
in  Figure  |9.4(a)|  (also  shown  as  the  solid  curves  in  Figures  [T7]  and  |9.9|).  This  is  due 
to  the  trade-offs  in  accuracy  and  complexity  with  p.  For  smaller  values  of  p,  a  larger 
number  of  iterations  can  be  performed,  (cf.  Figure  |9A|).  It  seems  that  there  is  a  point 
beyond  which  increasing  p  decreases  the  performance  of  the  hybrid  algorithm.  As  il¬ 
lustrated  in  Figure  |9.17[,  continuously  increasing  p  starting  from  p  =  pmm  also  increases 
the  accuracy  A(p)  of  the  PLSA  and  therefore  the  effectiveness  of  the  overall  algorithm. 
However,  when  a  certain  runtime  complexity  C(popt)  of  the  PLSA  is  reached,  the  benefit 
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of  higher  accuracy  may  be  outweighed  by  the  disadvantage  that  the  number  of  iterations 
that  can  be  explored  is  smaller.  As  a  consequence,  values  greater  than  popt  may  reduce 
the  overall  performance  as  the  number  of  iterations  is  too  low.  Figure  [7T5]  depicts  the 
performance  of  the  hybrid  with  p  fixed  for  the  voltage  scaling  application  on  six  differ¬ 
ent  applications.  It  can  be  seen  that  the  region  of  best  performance  is  not  as  pronounced 
as  in  the  MCMP  application,  and  that  this  optimal  value  of  p  is  different  for  different 
applications. 

The  observation  that  certain  parameter  ranges  appear  to  be  more  promising  than 
the  entire  range  of  permissible  p  values  leads  to  the  question  of  whether  the  heating 
schemes  can  do  better  when  using  the  reduced  range.  One  would  expect  that  the  static 
heating  schemes,  for  which  the  number  of  iterations  at  each  parameter  is  fixed  before¬ 
hand,  would  benefit  the  most  from  the  reduced  range,  since  the  hybrid  would  not  be 
“forced”  to  run  beyond  popt.  The  dynamic  heating  schemes,  by  contrast,  will  continue  to 
operate  on  a  given  parameter  as  long  as  the  quality  of  the  solution  is  improving.  For  the 
MCMP  application,  range  R2  =  [1,  39,  77, 116, 153]  is  centered  around  the  best  fixed 
p  values.  Figures  |977]  through  |9.10|  compare  the  performance  over  the  two  parameter 
ranges.  For  the  static  heating  optimizations  in  Figures  £T7]  and  |9.8|,  the  performance  is 
improved  by  using  the  reduced  parameter  ranges.  The  dynamic  heating  optimization  in 
Figure  |97|  shows  a  smaller  relative  improvement.  The  dynamic  heating  optimization 
in  Figure  |9.10|  actually  shows  a  benefit  to  using  the  expanded  parameter  range.  It  is 
important  to  note  that  in  practice  one  would  not  know  about  the  characteristics  of  the 
different  parameter  ranges  without  first  performing  an  optimization  at  each  value.  This 
would  take  much  longer  than  the  simulated  heating  optimization  itself,  so  in  practice  the 
broader  parameter  range  would  probably  be  used.  The  data  for  fixed  p  for  the  MCMP 
problem  (Figure  |9.4(a)|  and  |9.4(b)|)  demonstrate  that  it  can  be  difficult  to  find  the  optimal 
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p  value  and  that  this  optimum  may  be  isolated,  i.e.  p  values  close  (e.g.  100)  to  optimum 
yield  much  worse  results.  If  we  calculate  the  median  over  all  p  values  tried,  the  mean 
performance  of  the  constant  p  approach  is  worse  than  the  median  performance  of  the 
FTS  and  VIT  methods. 

Figure  |9.18|  compares  the  results  of  the  different  heating  schemes  for  the  MCMP 
application  with  population  size  N  =  100,  200, 50  and  parameter  range  R 1 .  Figure  |9.19| 
compares  the  heating  schemes  for  the  voltage  scaling  application  on  different  graphs  for 
both  types  of  local  search. 

Comparing  the  heating  schemes  across  all  different  cases,  we  see  that  the  dynamic 
heating  schemes  performed  better  in  general  than  the  static  heating  schemes.  For  all 
cases,  the  best  heating  scheme  was  dynamic.  For  the  binary  knapsack  problem  and  the 
voltage  scaling  problem,  simulated  heating  always  outperformed  the  standard  hybrid 
approach. 

For  the  MCMP  problem,  there  was  one  PLSA  parameter  where  the  standard  hybrid 
approach  slightly  outperformed  the  dynamic,  simulated  heating  approach.  We  note  that 
in  practice,  one  would  need  to  scan  the  entire  range  of  parameters  to  find  this  optimal 
value  of  fixed  p,  which  is  in  fact  equivalent  to  allotting  much  more  time  to  this  method. 
Thus,  we  can  say  that  the  simulated  heating  approach  outperformed  the  standard  hybrid 
approach  in  the  cases  we  studied. 

9.5.1  Effect  of  Population  Size 

Figure  |9.2(J|  shows  the  effect  of  the  population  size  for  MCMP  for  the  static  heating 
schemes.  Figure  |9.21|  shows  the  effect  of  population  size  on  the  dynamic  heating 
schemes  for  MCMP. 

For  FIS,  smaller  population  sizes  seem  to  be  preferable.  The  larger  number  of  itera- 
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MCMP  /  N  =  1 00  /  all  schemes 


FIS  FTS  VIT.I  VIT.T 


Figure  9.18:  Comparison  of  heating  schemes  for  MCMP  with  N  =  100.  The  two 
box  plots  on  left  correspond  to  the  static  heating  schemes.  The  two  box  plots  on  the 
right  correspond  to  dynamic  heating  schemes.  The  best  results  (lowest  memory  cost) 
are  obtained  for  the  VIT.T  dynamic  heating  scheme.  This  refers  to  variable  iterations 
and  time  per  parameter,  where  the  parameter  is  incremented  if  the  overall  solution  does 
not  improve  after  a  pre-determined  time,  called  the  stagnation  time.  The  solid  curve 
represents  the  standard  hybrid  approach  applied  at  different  values  of  fixed  p.  The  point 
p  —  39  slightly  outperforms  the  VIT.T  scheme. 
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FFT 1  Monte  Carlo  local  search 
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(a)  Monte  Carlo  local  search. 


(b)  Hill  climb  local  search. 


Figure  9.19:  Comparison  of  heating  schemes  for  voltage  scaling  with  (a)  Monte  Carlo 
and  (b)  hill  climb  local  search.  The  two  box  plots  on  left  correspond  to  the  FIS  and 
FTS  static  heating  schemes,  while  the  two  box  plots  on  the  right  correspond  to  dynamic 
heating  schemes  VIT.I  and  VIT.T.  The  line  across  the  middle  of  the  boxes  represents 
the  median  over  the  runs,  while  the  ‘whisker  lines’  are  drawn  at  the  10th  and  90th 
percentiles.  The  solid  curve  represents  the  standard  hybrid  approach  applied  at  different 
values  of  fixed  p.  In  this  application,  all  the  simulated  heating  schemes  outperformed 
the  standard  hybrid  approach.  The  best  results  were  obtained  for  the  dynamic  VIT.T 
scheme. 
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Figure  9.20:  Static  heating  with  different  population  sizes — |9.2U(a)|  FIS  and  |9.20(b)| 
FTS. 
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Figure  9.21: 
19.21(b)!  VIT.T. 
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Dynamic  heating  with  different  population  sizes — |9.21(a)|  VIT.I  and 
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tions  that  can  be  explored  for  N  —  50  may  be  an  explanation  for  the  better  performance. 
In  contrast,  the  heating  scheme  FTS  achieves  better  results  when  a  larger  population 
n  =  200  is  used.  For  the  dynamic  heating  schemes,  the  results  seem  to  be  less  sensitive 
to  the  population  size. 

9.6  Discussion 

Several  trends  in  the  experimental  data  are  summarized  below: 

•  The  dynamic  variants  of  the  simulated  heating  technique  outperformed  the  stan¬ 
dard  hybrid  global/local  search  technique. 

•  When  employing  the  standard  hybrid  method  utilizing  a  fixed  parameter  p,  an 
optimal  value  of  p  may  be  isolated  and  difficult  to  find  in  advance. 

•  Such  optimal  values  of  p  depend  on  the  application. 

•  When  performing  simulated  heating,  our  experiments  show  that  choosing  the  pa¬ 
rameter  range  to  lie  around  the  best  fixed  p  values  yields  better  results  than  using 
the  broadest  range  in  most  cases.  However,  using  the  broader  range  still  produces 
good  results,  and  this  is  the  method  most  likely  to  be  used  in  practice. 

•  The  dynamic  heating  schemes  show  less  sensitivity  to  this  parameter  range. 

•  Overall,  the  dynamic  heating  schemes  performed  better  than  the  static  heating 
schemes. 

•  The  dynamic  heating  schemes  were  also  less  sensitive  to  the  population  size  of 
the  global  search  algorithm. 
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9.7  Conclusions 


Efficient  local  search  algorithms,  which  refine  arbitrary  points  in  a  search  space  into 
better  solutions,  exist  in  many  practical  contexts.  In  many  cases,  these  local  search  algo¬ 
rithms  can  be  parameterized  so  as  to  trade  off  time  or  space  complexity  for  optimization 
accuracy.  We  call  these  parameterized  local  search  algorithms  (PLSAs).  We  have  shown 
that  a  hybrid  PLSA/EA  (parameterized  local  search/evolutionary  algorithm)  can  be  very 
effective  for  solving  complex  optimization  problems.  We  have  demonstrated  the  impor¬ 
tance  of  carefully  managing  the  run-time/accuracy  trade-offs  associated  with  EA/PLSA 
hybrid  algorithms,  and  have  introduced  a  novel  framework  of  simulated  heating  for  this 
purpose.  We  have  developed  both  static  and  dynamic  trade-off  management  strategies 
for  our  simulated  heating  framework,  and  have  evaluated  these  techniques  on  the  binary 
knapsack  problem  and  two  complex,  practical  optimization  problems  with  very  different 
structure.  These  problems  have  vast  solution  spaces,  and  underlying  PLSAs  that  exhibit 
a  wide  range  of  accuracy  /complexity  trade-offs.  We  have  shown  that,  in  the  context  of  a 
fixed  optimization  time  budget,  simulated  heating  better  utilizes  the  time  resources  and 
outperforms  the  standard  fixed  parameter  hybrid  methods.  In  addition,  we  have  shown 
that  the  simulated  heating  method  is  less  sensitive  to  the  parameter  settings. 
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Chapter  10 


Conclusions  and  Future  Work 


In  this  thesis  we  explored  the  implications  of  varying  degrees  of  connectivity  and  con¬ 
tention  in  multiprocessor  embedded  systems  for  digital  signal  processing  applications. 
A  trade-off  exists  in  such  systems  between  cost/complexity  and  reduced  resource  con¬ 
tention  (leading  to  higher  performance).  We  presented  techniques  for  analyzing  these 
trade-offs,  for  making  the  most  efficient  use  of  available  resources  at  a  given  design 
point,  and  for  streamlining  the  system  for  a  targeted  set  of  applications. 

The  simplest  and  cheapest  systems  utilize  a  shared  electrical  bus.  As  explained  in 
Chapter  [|.  the  shared  bus  precludes  an  analytic  expression  for  the  system  throughput, 
and  simulation  is  required  to  get  an  accurate  performance  measurement.  However,  simu¬ 
lation  is  computationally  expensive  and  it  is  undesirable  to  perform  repeated  simulations 
during  an  optimization.  We  developed  a  period  graph  model  that  can  be  used  as  a  com¬ 
putationally  efficient  estimator  for  the  throughput  in  these  systems.  We  demonstrated 
the  utility  of  this  estimator  by  using  it  in  a  genetic  algorithm  and  a  simulated  annealing 
algorithm  for  a  voltage  scaling  application  to  reduce  power. 

With  the  additional  expense  of  a  hardware  bus  controller  that  imposes  a  global  order¬ 
ing  of  all  communications,  it  is  possible  to  remove  the  contention  that  results  in  the  diffi- 
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cult  analysis  and  to  more  fully  optimize  communication  patterns  in  an  application.  This 
has  been  demonstrated  to  increase  the  performance  and  to  be  a  useful  cost/performance 
trade-off  for  several  applications  [11011].  For  highly  parallel  applications  with  more  se¬ 
vere  real-time  constraints,  however,  the  single  bus  becomes  a  bottleneck  for  communica¬ 
tion  between  processors.  In  Chapter  |5]  we  introduced  a  system  architecture  that  utilizes 
optical  fiber  interconnects  over  multiple  wavelengths.  This  enables  multiple,  simulta¬ 
neous  communications  and  increases  the  system  throughput.  In  this  architecture  there 
is  a  controller  for  each  communication  wavelength,  and  we  introduced  a  modification 
of  the  TPO  heuristic  [E3]  for  determining  optimal  communication  orderings  for  all  the 
wavelengths.  We  quantified  the  performance  improvement  over  the  single  bus  controller 
for  several  applications. 

A  wide  range  of  scheduling  techniques  for  multiprocessor  systems  have  been  de¬ 
veloped.  However,  these  techniques  typically  assume  a  fixed  communication  network 
and  do  not  systematically  incorporate  connectivity  constraints.  Connectivity  constraints 
may  be  dictated  by  cost,  area,  or  power  constraints.  Due  to  the  power  consumption  char¬ 
acteristics  of  optical  links,  it  is  useful  to  restrict  communication  across  them  to  low-hop 
transfers.  Connectivity  constraints  cause  existing  multiprocessor  scheduling  methods  to 
deadlock.  In  Chapter  §  we  demonstrated  a  polynomial  complexity  algorithm  for  deter¬ 
mining  the  set  of  feasible  processors  that  will  avoid  schedule  deadlock  in  a  limited-hop 
schedule.  We  also  introduced  a  useful  metric,  called  communication  flexibility,  for  the 
degree  to  which  a  given  scheduling  decision  constrains  future  scheduling  decisions  (in 
the  context  of  the  given  communication  topology).  We  used  this  algorithm  and  the 
flexibility  metric  in  conjunction  with  a  standard  dynamic  list  scheduling  algorithm  to 
effectively  map  several  DSP  applications  across  a  wide  range  of  interconnect  topolo¬ 
gies. 
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In  Chapter  [7]  we  explored  the  problem  of  deriving  an  interconnect  network  for  a 
given  application  that  minimizes  the  number  of  links  required  and  maintains  fanout 
constraints,  while  also  satisfying  the  throughput  or  latency  requirements  of  the  appli¬ 
cation.  This  problem  is  important  in  today’s  system-on-chip  (SoC)  designs  as  well  as 
future  SoC  designs  that  might  utilize  optical  interconnects.  We  described  probabilistic 
and  deterministic  algorithms  for  interconnect  synthesis.  A  key  distinguishing  feature 
of  our  technique  is  that  we  perform  scheduling  and  interconnect  synthesis  together — 
existing  interconnect  synthesis  algorithms  assume  a  given  application  mapping  exists 
before  performing  the  interconnect  synthesis.  We  demonstrated  how  the  design  space 
can  be  greatly  reduced  by  considering  graph  isomorphism,  and  utilized  an  efficient  graph 
isomorphism  tests  in  our  deterministic  algorithm. 

Most  optimization  problems  that  arise  in  hardware-software  co-design  are  highly 
complex.  The  scheduling,  interconnect  synthesis,  memory,  and  voltage  scaling  opti¬ 
mization  problems  investigated  in  this  thesis  all  involve  searching  vast  design  spaces.  In 
Chapter  |S]  we  demonstrated  that  a  hybrid  PLS A/EA  (parameterized  local  search/evolutionary 
algorithm)  can  be  very  effective  for  solving  these  complex  optimization  problems.  We 
presented  PLSAs  for  the  voltage  scaling,  interconnect  synthesis,  and  ordered  transac¬ 
tions  problems. 

We  demonstrated  the  importance  of  carefully  managing  the  run-time/accuracy  trade¬ 
offs  associated  with  EA/PLSA  hybrid  algorithms,  and  introduced  a  novel  framework  of 
simulated  heating  for  this  purpose.  We  developed  both  static  and  dynamic  trade-off 
management  strategies  for  our  simulated  heating  framework,  and  in  Chapter  evaluated 
these  techniques  on  the  voltage  scaling  problem,  a  memory  cost  minimization  problem, 
and  the  binary  knapsack  problem.  Simulated  heating  experiments  with  the  interconnect 
synthesis  problem  and  the  ordered  transactions  problem  are  two  directions  for  future 
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work. 


The  PLSAs  underlying  these  problems  exhibit  a  wide  range  of  accuracy /complexity 
trade-offs.  We  have  shown  that,  in  the  context  of  a  fixed  optimization  time  budget,  sim¬ 
ulated  heating  better  utilizes  the  time  resources  and  outperforms  the  standard  fixed  pa¬ 
rameter  hybrid  methods.  In  addition,  we  have  shown  that  the  simulated  heating  method 
is  less  sensitive  to  the  parameter  settings. 
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Appendix 


.1  Random  Graph  Generation  Algorithm 

Sih’s  random  graph  generator  [ESI]  produces  graphs  with  characteristics  similar  to  those 
of  many  DSP  benchmarks.  We  made  several  modifications  to  this  algorithm  to  generate 
the  random  graphs  used  in  this  thesis. 

First,  before  we  add  a  random  edge  we  first  check  (using  Warshall’s  algorithm  for 
transitive  closure)  that  the  edge  will  not  introduce  a  cycle  in  the  graph.  Second,  we 
input  the  number  of  nodes  in  the  graph  instead  of  the  graph  length.  Third,  we  make 
the  maximum  fanout  from  each  node  an  explicit  input.  This  controls  the  amount  of 
parallelism  in  the  graph.  Pseudo-code  for  the  algorithm  is  given  in  Figures  [Tj.  [2].  and 
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Algorithm  A.l:  genRandomGraph( 

startNodes,  numNodes,  fanout,  lowExecTime,  highExecTime,  lowIPCcost,  highIPCcost 

) 


procedure  createNode(Iow,  high) 

comment:  Create  and  return  a  new  node  with  random  execution  time  £  [low.  high] 
Create  new  node  n 

n. exec  Time  <—  uniformRandom(Iow,  high) 

return  (n) 

procedure  coNNECTNoDEs(src,  snk,  adjM) 
comment:  Connect  src  node  to  snk  node  in  adjacency  matrix 

adjM[src][snk]  <—  1 


procedure  EXTENDNoDE(oldEndNode,endNodes,  fanout) 
comment:  Connect  a  newly  created  node  to  one  of  the  endNodes 


r  <—  uniformRandom(1  , fanout) 
for  (i  <—  1 . . .  r) 

'm  <—  cREATENoDE(lowExecTime,  highExecTime) 
coNNECTNoDES(oldEndNode,  m) 
endNodes.delete(oldEndNode) 

[endNodes. add(m) 


do  < 


procedure  coNVERGE(nodesToConverge,  endNodes) 
comment:  Cause  some  endNodes  to  all  converge  to  a  single  node 

p  <—  cREATENoDE(lowExecTime,  highExecTime) 
for  i  <—  1 . . .  nodesToConverge.size() 

{nodesToConverge.deleteHead(h) 
endNodes.delete(h) 

CONNECTNODES(h,p) 

endNodes. add(p) 


procedure  DivERGE(endNodes,  num,  V,  divergedNodes) 
comment:  randomly  chosen  endnode  diverges  out 

fori  <—  1. .  .num 

{n  <—  cREATENoDE(lowExecTime,  highExecTime) 
connectNodes(V,  n) 
divergedNodes.  add(n) 
endNodes.delete(V) 


Figure  1:  Pseudo-code  for  procedures  used  in  the  random  graph  algorithm. 
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Algorithm  A.l:  genRandomGraph( 

startNodes,  numNodes,  fanout,  lowExecTime,  highExecTime,  lowIPCcost,  highIPCcost 

) 


procedure  DivERGECoNVERGE(endNodes,  V,  W,  L,  numAdded) 
comment:  Attach  a  structure  which  diverges  and  then  converges  to  an  endnode 

w  <—  uniformRandom(2,  W) 

nodeList  <-  0 

DiVERGE(endNodes,  w,  V,  divergedNodes) 
len  <-  choose  randomly  from  [0  . . .  L] 
for  i  <—  1 . . .  w 

{divergedNodes.deleteHead(h) 

n  «—  EXTENDNoDE(h,endNodes,len) 

nodeList.  add(n) 

coNVERGE(nodeList,endNodes) 
numAdded  <—  w(len  +  1)  +  1 


procedure  PiCKRANDOMLY(nodeList,  n) 
comment:  Create  a  random  list  of  n  nodes  from  nodeList 


while  (nodeList.size()  <  n) 

'p  «-  nodeList.firstPtr 
r  <—  uniformRandom(0,  nodeList. size) 
for  i  €  [1 . . .  r] 


do  < 


do  {p 
ifp  £  S 


p.next 


then 


rs^su{P} 

\  nodeList.  insert(p) 


procedure  RANDOMCoNNECTiON(adjM) 
comment:  Add  a  random  edge  that  doesn’t  create  a  cycle 


Ok  4-  FALSE 
while  (Ok  =  FALSE) 

'h  «—  PICKRANDOMLY(allN0deS, 
t  PICKRANDOMLY(allNodeS, 
do  TRANSITIVECLOSURE(adjM) 
if  (PATH(h,t)  =  0) 
then  {ok  <—  TRUE 

CONNECTNODES(h,t,adjM) 


1) 

1) 


Figure  2:  Pseudo-code  for  the  random  graph  algorithm  (continued  from  Figure  [I]). 
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Algorithm  A.l:  genRandomGraph( 

startNodes.  numNodes,  fanout,  lowExecTime,  highExecTime,  lowIPCcost,  highIPCcost 

) 


main 
for  i 


do 


startNodes 

f u  <—  CREATENoDE(lowExecTime,  highExecTime) 
jendNodes.add(u) 
n  <-  startNodes 
while  (n  <  numNodes) 

'actionNumber  <—  UN1FORMRANDOM(0, 100) 
if  (actionNumber  <  20) 

{v  <—  piCKRANDOMLY(endNodes,  1) 
extendNode(v,  endNodes,  1) 
n  <-  n  +  1 

else  if  (actionNumber  <  40) 

{c  <—  uniformRandom(1, fanout) 
convNodes  <—  PlCKRANDOMLY(endNodes,  c) 
coNVERGE(convNodes,  endNodes) 
n  <-  n  +  1 

else  if  (actionNumber  <  80) 

{d  <—  uniformRandom(1, fanout) 
u  <—  piCKRANDOMLY(endNodes,  1) 
DivERGE(endNodes,  d,  u,  divergedNodes) 
n  n  +  d 

else  if  (actionNumber  <  100) 


do  < 


do 


do 


do 


for  i  <—  1 


-  piCKRANDOMLY(endNodes,  1) 

■  uniformRandom(1,  fanout) 

\  DivERGECoNVERGE(endNodes,  u,  d,  e,  numAdded) 

[n  n  +  numAdded 

numRandomConnections 


do  {RANDOMCONNECTION(adjM) 


Figure  3:  Pseudo-code  for  the  random  graph  algorithm  (continued  from  Figure  0). 
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